Neural Language Generation Head to Toe

NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Hady Elsahar
@hadyelsahar
Hady.elsahar@naverlabs.com
1
Neural Language
Generation
Head to Toe

PhD.
2019
Research
Scientist
2020
About Me
http://hadyelsahar.io
Intern.
2018
Intern.
Intern.
Research interests:
Controlled NLG
- Distributional Control of Language
generation
- Energy Based Models, MCMC
- Self-supervised NLG
Domain Adaptation
- Domain shift detection
- Model Calibration
Side Gigs I actively participate in wikipedia
research. I build tools to help editors from
under-resourced language. (like Scribe)
Masakhane A grassroots NLP community for
Africa, by Africans
https://www.masakhane.io/
2013
2014
Masters
2015
Side gigs
@hadyelsahar

Great Resources Online
That helped developing this tutorial
Lena Voita - NLP Course | For You
https://lena-voita.github.io/nlp_course.html
CS224n: Natural Language Processing with Deep Learning
Stanford / Winter 2021
http://web.stanford.edu/class/cs224n/
Lisbon Machine Learning School
http://lxmls.it.pt/2020/
Speech and Language Processing [Book]
Dan Jurafsky and James H. Martin
https://web.stanford.edu/~jurafsky/slp3/
Courses Books
You are smart & the Internet is full of great
resources. But might be also confusing, many of the
parts in this tutorials took me hours to grasp. I am
only here to make it easier for you :)
Nothing in this tutorial you cannot ﬁnd online

What are we going to Learn
today?
- Introduction to Language Modeling
- Recurrent Neural Networks (RNN)
- How to generate text from Neural
Networks
- How to Evaluate Language Models
Seq2seq Models
- Conditional Language Model
- Seq2seq with Attention
- Transformers
Open Problems of NLG Stochastic Parrots
Part 1: Language Modeling Part 2: Stuff with Attention

Part 1: Language Modeling
- N-gram Language Models
- Neural Language models
- Generating text from Language Models (fun!)
- Temperature Scaling
- Top-k / Nucleus Sampling
- LM Evaluation:
- Cross Entropy, Perplexity

Language Modeling
What is language modeling?
Assigning probabilities to sequences of words (or tokens).
P(“The cat sits on the mat”) = 0.0001
P(“Imhotep was an Egyptian chancellor to the Pharaoh Djoser”) = 0.004
P(“”Zarathushtra was an ancient spiritual leader who founded Zoroastrianism.) = 0.005
Why on earth? ...

Language Modeling
Language models are used everywhere!
Search Engines
P(“global warming is caused by”) = 0.1
P(“global warming is a phenomenon related to ”) = 0.05
P(“global warming is due to”) = 0.03
...
You can rank sentences (search Queries) by
their probability

Language Modeling
Language models are used everywhere!
Spell checkers
P(“... is Wednesday ….”) = 0.1
P(“... is Wendnesday …. ”) = 0.000005
...
You can recommend rewritings
based on probability of sequences.

Conditional Language Modeling
We can assign probabilities to sequences of words
Conditioned on a Context
What is a context ?
...
Context can be anything:
Image
● Speech
● Text
● Text in another
language
P(“I am hungry” | “J’ai Faim”) = 0.1
P(“I am happy” | “J’ai Faim”) = 0.002
P(“I was hungry” | “J’ai Faim”) = 0.0002
P(“he am hungry” | “J’ai Faim”) = 0.00001

Conditional Language models are even more popular
Machine translation
This is a notation for conditional
probability
Sequences with
correct translations
are given high probability

Abstractive Summarization
P( “COV D numbers are dropping down” | 📄📄📄) = 0.6
P( “COV D stats in France” | 📄📄📄) = 0.005
P( “ COV D the coronavirus pandemic” | 📄📄📄) = 0.0001
Better Summaries
are given high probabilities

Image captioning
P( “A dog eating in the park” | 🐕 🍔🌳 ) = 0.6
P( “A dog in the park” | 🐕 🍔🌳 ) = 0.03
P( “A cat in the tree” | 🐕 🍔🌳 ) = 0.002
Sequences with
correct captions
are given
high probability
We will learn that with
our ﬁrst language model
Still, How to get
those probabilities
? ...

FYI - Break
Now you know why assigning probabilities to sequences of words is
important. Letʼs see how we can do that. But ﬁrst ..
Language Modeling
(Conditional) Language Modeling
We will start by
this one since it
is simpler
We will get to
this one later

🍼 A Very Naive Language Model
Calculate probability of a sentence given in a large amount of text.
P(“Elephants are the largest existing land animals.”) = ??
A dataset of 100k sentences
Count(“Elephants are the largest existing land animals.”) = 15
P(“Elephants are the largest existing land animals.”) = 15/100k = 0.00015
Mmmm.. What could possibly go wrong?

🍼 A Very Naive Language Model
Letʼs use it for spell checking
All sequences are equally wrong?...
Count (“ My Favourite day of the week is Wednesday ….”) = 0.00000
Count (“ My Favourite day of the week is Wendnesday ….”) = 0.00000
Count ( “ asdal;qw;e@@k__+0$%$^%….”) = 0.00000
P(“ My Favourite day of the week is Wednesday ….”) = 0.00000
P(“ My Favourite day of the week is Wendnesday ….”) = 0.00000
P(““ asdal;qw;e@@k__+0$%$^%….”) = 0.00000
Sequences are not in the dataset
A dataset of 100k
sentences

Question Time
How many unique (valid or invalid)
sentences of length 10 words
can we make out of english language?
If the Number of unique words in english = 1Million

Question Time
If the Number of unique words in english = 1Million
Answer: 1M x 1M x 1M … (10 times) = 1000000 10
= 1 x 10 60
Each time we have 1M word to select from.
How many unique (valid or invalid)
sentences of length 10 words

Question Time
How many unique sentences of
MAX length 10 words
Number of unique words in english = 1Million

Question Time
How many unique sentences of
MAX length 10 words
Number of unique words in english = 1Million
Answer: 1M + 1M 2
+ 1M 3
+ 1M 4
+ …. + 1M 10
= 1.00000160
All sequences of
length 1 word
All sequences of
length 3 word

Combinatorial explosion
In Language Generation
Log
scale
Number of possible english
sentences of length 50 words is
~ 1x 10660
Number of atoms in the universe
~ 1x 1082
No dataset can have such
number of sentences. Most of
sentences will have zero
probabilities.

Using the chain rule, as follows:
P(Elephants are smart animals) =
P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart)
These are quite easy to ﬁnd in a limited
size corpus.
Part of the problem is already solved!
This are still hard to ﬁnd, we will learn
how to deal with those at a later point.
Atomic units of calculating probabilities
Became words instead
of full sentences
From Sequence Modeling
To modeling the probability of the next word

From Sequence Modeling
To modeling the probability of the next word
Using the chain rule, as follows:
w<t
is the notation for all
Words before time step t
W (in bold) is a
sequence of N
words
w1
w2
,.... wN

FYI - Break
There are terms associated to this method of modeling language:
“Left to Right language modeling” , “autoregressive Language models”
Other ways of modeling language (not discussed in this tutorial).
Bidirectional Language Modeling
Words in sentences are Generated
independently
Right context
Non-autoregressive Language Models

Language Modeling
- N-gram Language Models (Recap)
- LM Evaluation:

N-Gram Language Models

sequence unigrams bigrams trigrams
n=1 n=2 n=3
Elephants are smart animals ● Elephants
● are
● smart
● animals
● Elephants are
● are smart
● smart animals
● Elephants are smart
● are smart animals
What is an
N-gram? ...

P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
These are quite easy to ﬁnd and count
in a limited size corpus.
We will learn how to deal with those now!
Recall this problem?

P(Elephants) x
Assumption: N-gram language models uses the assumption that a probability of a word only
depends on it N previous tokens.
P(Elephants) x
Tri-gram language model

P(Elephants) x
Assumption: N-gram language models uses the assumption that a probability of a word only
depends on it N previous tokens.
P(Elephants) x
P(they | are , smart, animals ) x
P(are | smart, animals , they) x
P(quite | animals , they , are) x
P(big | they , are, quite)
Tri-gram language model
Now these are easier to
ﬁnd and count in a
corpus

FYI - Break
Markov Property
Born 14 June 1856 N.S.
Ryazan, Russian Empire
Died 20 July 1922 (aged 66)
Petrograd, Russian SFSR
Nationality Russian
https://en.wikipedia.org/wiki/Markov_property
P(Elephants) x

Language Modeling
- LM Evaluation:

Neural Language models
Recurrent Neural Networks (RNN)

Neural Language Modeling
Remember this?
w<t
is the notation for all
Words before time step t
* empty sequence
We are going to learn a neural network for this
Ө are the learnable params of
the Neural network.
Animation from NLP course | For you by Lena Voita
https://lena-voita.github.io/nlp_course/language_modeling.html

Language Modeling using Feed Fwd NN
Ө are the learnable params of
the Neural network.
Өe
Өw
1 hot encoding of
Words
Embedding
layer
Word
embeddings
;
|V| x
hembed
L x |V|
L x hembed
L hembed
x 1
concatenate
L hembed
x |V|
T
|V| x 1 |V| x 1
softmax
Projection
layer
These are
usually called
logits
Prob distribution over words
in vocab
Can we use a Feed Fwd Neural Network?
elephants
are
the
smartest
PӨ
(animals | elephants, are, the, smartest) = ?
animals = 0.1
Fixed width
Of L tokens

Өe
Өw
1 hot encoding of
Words
Embedding
layer
Word
embeddings
;
|V| x
hembed
L x |V|
L x hembed
L hembed
x 1
concatenate
L hembed
x |V|
T
|V| x 1 |V| x 1
softmax
Projection
layer
These are
usually called
logits
in vocab
Can we use a Feed Fwd Neural Network?
elephants
are
the
smartest
PӨ
(animals | elephants, are, the, smartest) = ?
animals = 0.1
Fixed width
Of L tokens
Here L = 4
PӨ
(they| elephants, are, the, smartest, animals) → L > 4 (not possible)
PӨ
(they| elephants, are, the, smartest, animals) → Markov assumption (now possible)
PӨ
(the| elephants, are) → L < 4 (not possible)
PӨ
(they|<MASK> , <MASK> , elephants , are) → adding dummy tokens (now possible)
In practice this isn’t a good idea. The Problem with ﬁxed length input

1. Recurrent (calculated repeatedly)
2. Great with Languages !
3. Have an internal memory called “Hidden state”
4. Can model inﬁnite length sequences
5. No need for markov assumption
Recurrent Neural Networks
1 hot encoding of
Words
1 x |V|
h1
h<BOS>
elephants
RNN
+
P(are | elephants) = 0.1
are
RNN
+
P(smart| elephants , are) = 0.2
h2
h3
…….
…….
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
U
The core RNN is considered
only this part. As it operates
on sequences of continuous
vectors
elephants
are = 0.1
Probability of each word in the
vocabulary
Select the one corresponds to
the next token
smart
RNN
+
P(animals| elephants , are, smart) = 0.5
logits
xt
x.
t
yt
ŷt

Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
RNN
For simplicity the ﬁgure doesn’t include the bias
terms b and c
U
Embed 1 hot vectors to
word embeddings.
Wi
could be trained or
kept frozen
Calculation of the hidden
state representations
ht
depends on ht-1
Project the hidden state
into the output space.
Calculate probabilities
out of the logits
logits
xt
x.
t
yt

Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
terms b and c
xt
U
Embed 1 hot vectors to word
embeddings.
W
i
could be trained or kept
frozen
yi
t
x.
t
Calculation of the hidden
state representations
ht
depends on ht-1
Project the hidden state into
the output space.
Calculate probabilities
out of the logits
Wi
Wo
V U
are trainable
parameters
Ok but how to
train them ?
yt
Prob. of word i at time step t
given by the model

Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identiﬁed by their larger ears ….
1) Collect large amount of free text

Earth.
They are slightly larger than their Asian cousins.
They can be identiﬁed by their larger ears.
1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)

Earth.
<s> African elephants are the largest land animals on Earth . </s>
<s> They are slightly larger than their Asian cousins . </s>
<s> They can be identiﬁed by their larger ears . </s>
3) split into tokens
(e.g. word, char, sub-word units)
Add start of seq and end of seq tokens
<s> </s>

Earth.
<s> </s>
0 <s>
1 </s>
2 They
3 on
4 elephants
5 African
6 larger
..
1200 ears
4) Build a vocabulary V
More on
tokenization in
the next lecture

Earth.
<s> </s>
0 <s>
1 </s>
2 They
3 on
4 elephants
5 African
6 larger
..
1200 ears
4) Build a vocabulary V
More on
tokenization in
the next lecture
- [0, 2 , 4 ,5, 6, 7, 8, 8, 101, 22, 1]
- [0, 22, 45, 65, 78, 9, 3, 4, 2, 1]
- [0, 1, 23, 3, 4, 5, 65, 7, 7, 8, 1]
4) Index training data
Each word (token) can be represented as a
one hot vector now!

[0, 2 , 4 ,5, 6, 10]
<s>
African
cost
<s> African elephants are smart </s>

[0, 2 , 4 ,5, 6, 10]
<s>
African elephants
cost cost

[0, 2 , 4 ,5, 6, 10]
<s> African elephants
African elephants are
cost cost cost

[0, 2 , 4 ,5, 6, 10]
<s> African elephants are
African elephants are smart
cost cost cost cost

[0, 2 , 4 ,5, 6, 10]
<s> African elephants are smart
African elephants are smart </s>
cost cost cost cost cost
Cross
entropy loss

[0, 2 , 4 ,5, 6, 10]
Cross
entropy loss
Loss function
Gradients updates
through
backpropagation

[0, 2 , 4 ,5, 6, 10]
Cross
entropy loss
Loss function
Gradients updates
through
backpropagation
How to calculate
the cross
entropy loss?

Cross Entropy
Claude Shannon Page on wikipedia
The “surprisal” of PӨ
(empirical distribution) for samples generated from D (the true
data distribution).
Cross Entropy can also be seen as a “closeness" measure between two distributions.
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Cross entropy
direction
matters

Cross Entropy loss (-ve log likelihood)
cost
Output distribution
of tokens in the vocab
By your RNN at time
step t
yn
t i
= 1 if i is the correct token
yn
t i
= 0 if i is not the correct token
yn
t i
Training Example
Position in the vocab , logits,
probability vectors or 1 hot
vectors
Time step t
pn
t i
prob of the model
to the token i in example n and
time step t
Let’s simplify the notation:
yn
t i_correct =
yn
t
= 1
pn
t i_correct =
pn
t
No need
to write
Prob of correct
token by the model
Prob of correct
token given
previous context
Prob of correct
sequence
Negative Log Likelihood
N = number of training examples

Maximum log likelihood loss
Negative Log Likelihood
Minimize loss → minimize Negative log likelihood → Maximum log likelihood Estimation (MLE)
This objective is usually written like that
Probability of the true sequence y given a language
model parameterized by Ө
Find the parameters Ө that
minimizes the -ve log likelihood
Usually done using SGD

Practical Tips (parameter sharing)
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
Embedding layer
|V| x Hembed
Output
Embedding layer
|V| x Hstate
terms b and c
xt
U
yi
t
x.
t
yt
Prob. of word i at time step t
given by the model
Wi
Wo
FYI - Break
It is a common practice to unify the embedding sizes Across the whole
network. Hstate
= Hembed
This makes both input / output embedding layers have the same
dimensionality. You can tie their weights to reduce parameter size of the
RNN . (this is called weight tying / parameter sharing)
Share both as one
matrix

RNNs enjoy great ﬂexibility
FYI - Break
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

The last “hidden state” of a Recurrent Neural Networks could be a
sentence repsententation, that can be used later for many tasks e.g.
Classiﬁcation.
FYI - Break
Feed Fwd
Neural
Network
Sentiment = positive
Classiﬁcation loss

You can Stack Recurrent Neural Networks
FYI - Break
RNN RNN RNN RNN
RNN
RNN
RNN
RNN
Layer 2 RNN runs over a
sequence of “hidden states”
(not softmax output)
of Layer 1 RNNs

Can I use this RNN to
generate text?
Now we know how to
model text probabilities
using Recurrent Neural
Networks.
Decoding | Inference
After all, it is called Neural Language “Generation”

Let’s see a
demo !!
Demo Break
Write With Transformer
https://transformer.huggingface.co/doc/gpt
https://beta.openai.com/playground

Language Modeling
- Generating text from Language Models
- Greedy Decoding
- LM Evaluation:

<s>
RNNs output a categorical distribution over Tokens in the vocab at each step
pӨ
( * | <s>)
Elephants
I
He
They
….
smart
animals
Giraﬀes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
Auto-regressive decoding

RNNs output a Multinomial distribution over Tokens in the vocab at each step
1) Select next token (we will see later diﬀerent selection methods)
Elephants
He
They
I
….
smart
animals
Giraﬀes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
<s>

1) Select next token
Elephants
He
They
I
….
smart
animals
Giraﬀes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
<s>
Elephants

2) Feed selected token (auto-regressively) to the RNN and calculate pӨ
( * | <s>, I)
Elephants
He
They
I
….
are
animals
Giraﬀes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
<s> Elephants
Elephants

3) Select next token
Elephants
He
They
I
….
are
animals
Giraﬀes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
<s> Elephants
Elephants
are

Repeat the process until end of sequence token </s> or max length is reached.
<s> Elephants
Elephants are

<s> Elephants are
Elephants are smart

<s> Elephants are smart
Elephants are smart animals

<s> Elephants are smart animals
Elephants are smart animals </s>

How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Elephants
He
They
I
….
are
animals
Giraﬀes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
The maximum value?

vocab.
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Ancestrall (pure) Sampling
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature

vocab.
Local Global
Deterministic
Stochastic / Random
Greedy decoding MAP
Beam Search
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature

Greedy Decoding
Select the token with max probability
Elephants
He
They
I
….
are
animals
Giraﬀes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.25
0.001
0.016
0.2
0.001
….
….

Question Time
During Greedy decoding, at each step the most likely token is selected:
Will that generate the highest likely sequence overall?
a) Yes, always
b) No (but it could happen)
c) Never

Answer
“local” vs “Global” likelihood in sequence generation.
a
b
a
b
*
0.7
0.3
0.55
0.45
a
b
a
b
0.45
Imagine vocabulary of two tokens “a” and “b”
Run greedy decoding for 3 time steps.
Selected sequence “a b b”
P(“a b b”) = 0.6 * 0.55 * 0.55 = 0.1815
Other sequences could have globally higher
probability:
P(“b a b”) = 0.4 * 0.8 * 0.9 = 0.288
a
b
0.1
0.2
0.8
0.9
0.55

Question Time
Given a trained Language model pӨ
If we run greedy decoding for the context: “<s>” until “</s>” is obtained.
We repeat this process 1000 times.
How many unique sequences will be obtained?
a) 1000
b) Inﬁnity
c) 1
d) 42
e) 75000

Ancestral (Pure) Sampling
Also called “Pure” Sampling, Standard Sampling, or just Sampling.
Sampling is stochastic (random) ≠ deterministic
Elephants
He
They
I
….
are
animals
Giraﬀes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.25
0.001
0.016
0.2
0.001
….
….
Pure sampling will obtain “unbiased”
samples.
I.e. distribution of generated sequences
matches the Language model
distribution over sequences.

Question Time
During Pure Sampling, at each step x is sampled:
Will that generate the highest likely sequence overall?
a) Yes, always
b) No (but it could happen)
c) Never

Question Time
Given the following conditional probabilities of a trained language
model, If we run pure sampling 10000 times and Greedy 1000 times.
How many times the sequence “b b” will be obtained:
a) 0 using Greedy & 100 using sampling
b) 100 using Greedy & 10 using sampling
c) 10 times Greedy & 10 using sampling
d) 0 times Greedy & 10000 using sampling
a
b
a
b
*
0.9
0.55
0.45
a
b
0.1
0.9
0.1
Start of
sequence
(empty)
p(b| *)
p(b| b * )
p(a| *)
p(a| a *)
p(b | a *)

Ancestral (Pure) Sampling
Pros
- Diversity in generations
(not always the same sequence)
- Generated samples reﬂect the
Language Model probability
distribution of sequences (Unbiased).
Cons
Pure sampling sometimes lead to incoherent
text
Fig. from THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)

Sampling with Temperature
Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier
I.e. less likely to sample from unlikely candidates
Higher temperature produces a softer probability distribution over tokens,
resulting in more diversity and also more mistakes.
Divide the logits by a
temperature
(Constant) value T
As T decreases
( yi
/ T ) increases
|V| x 1 |V| x 1
softmax
These are
usually called
logits
in vocab
p(xt
| x<t
) = 0.1
➗ T
Good read on the topic https://medium.com/@majid.ghafouri/why-should-we-use-temperature-in-softmax-3709f4e0161

Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature
Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier
I.e. less likely to sample from unlikely candidates

Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature
Higher temperature (> 1) produces a softer probability distribution over
tokens, resulting in more diversity and also more mistakes.

Let’s see a demo
!!
Demo

Top-k Sampling
Hierarchical Neural Story Generation (Fan et al. 2018)
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.3
0.25
0.1
0.1
0.03
0.02
0.01
0.01
0.01
….
….
K = 4
0.3
0.25
0.1
0.1
0.399
0.333
0.133
0.133
normalize Sample
At each timestep, randomly sample from the k most likely candidates from the token distribution
He

Top-k Sampling
Hierarchical Neural Story Generation (Fan et al. 2018)
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.3
0.25
0.1
0.1
0.03
0.02
0.01
0.01
0.01
….
….
K = 4
A fixed size k in Top-k sampling is not always a good idea:
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.6
0.3
0.02
0.02
0.02
0.02
0.01
0.01
0.01
….
….
K = 4

Top-p (Nucleus) Sampling
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.3
0.2
0.1
0.05
0.03
0.02
0.01
0.01
0.01
….
….
p = 0.6
0.3
0.2
0.1
0.5
0.333
0.166
normalize Sample
Elephants
Sample from the top p % of the probability mass
THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)

Top-p (Nucleus) Sampling
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.3
0.3
0.5
0.1
0.03
0.02
0.01
0.01
0.01
….
….
Top-p = Adaptive top-k
Elephants
He
They
I
are
animals
Giraﬀes
Think
</s>
…….
…...
0.4
0.4
0.02
0.02
0.02
0.02
0.01
0.01
0.01
….
….
p = 0.8
p = 0.8
THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)

vocab.
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
Next Lecture!

Demo again!
Demo Break

Language Model Evaluation
& More about “Cross-Entropy”

Language Modeling Evaluation
For classification Higher Accuracy is always better:
● Accuracy = 80% Better than 60%
But for language modeling?
A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Language Model Is that
good?
P(x) = 0.001
Test set x

Language Modeling Evaluation
Intrinsic Metrics
● perplexity
● cross entropy
● bits-per-character (BPC)
A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Extrinsic Metrics
Language Model
X ~ PӨ
Grammatically correct
Fluent
Coherent
What are
these?

Background: Entropy (information theory)
Imagine a process that generates samples e.g. Language Model PӨ
that generates a
sequence xi
~ PӨ
–log(PӨ
(xi
)) is the “surprisal”
PӨ
(xi
)
–log(PӨ
(xi
))
If a model generates samples with low probability it will
be have high surprisal on them
Samples with higher probability the model is
conﬁdent about them (i.e. low surprisal)

Imagine a process that generates samples e.g. Language Model PӨ
that generates a
sequence xi
Entropy is the Expected level of “surprisal”
Entropy is usually
denoted by H
How on average the model
is surprised by its
samples. I.e. how
unconﬁdent it is about its
samples
This is expectations
= “mean value”
Expectations in theory is calculated using inﬁnite samples or
closed form but could be approximated using large N samples

What does it mean that your model has low Entropy?
Q : Imagine a language model that generates only one sample x= “elephants are
smart” with PӨ
(x) = 1 , what is the entropy of this language model H(PӨ
) ?
a) 1
b) zero
c) inﬁnity
● Low entropy tells you that your model
is not random (i.e. learned something)
● But it could be conﬁdent about the wrong
things.

smart” with PӨ
) ?
a) 1
b) zero
c) inﬁnity
No “surprisal” here the model is
so conﬁdent about the only example it generates!

smart” with PӨ
) ?
a) 1
b) zero
c) inﬁnity
No “surprisal” here the model is
so conﬁdent about the only example it generates!
What did you teach us
about it if it is a bad
metric!

Cross Entropy
The “surprisal” of PӨ
for samples generated from D (the true Language distribution).
Cross Entropy can also be seen as a “closeness" measure between two distributions.
Entropy
Cross entropy
direction
matters

Cross Entropy rate per token
Given that we are interested in sentences (sequences of tokens) of length n, we will
use the the entropy rate per token:
Cross entropy rate per token
Cross entropy
This equality is true on in the limit when n is infinitely long, more
details you can see Shannon-McMillan-Breiman theorem
We can estimate the cross-entropy measuring model log prob on a
random sample of sentences or a very large chunk of text.
How do we know the
probability of the
true prob. Of a
sentence in the
whole language?
Large
number of
random
samples

Which log to use ?
In all the previous theory, the entropy and cross
entropy are defined using log base 2 (with "bit" as
the unit),
“Popular machine learning frameworks, implement
cross entropy loss using natural log. As it is faster
to compute natural log as opposed to log base 2.”
It is often not reported in papers which log they
use, but mostly it is safe to assume the “natural
log”
Source: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Natural log
base e
Log base 2 is used in
“information” theory due to
its relation with bits and
bytes

Bits per Character BPC
Cross entropy rate per word /char / token
Bits per (character | Token | word)
BPC BPT BPW
If log base 2 is used, Cross-Entropy per
word becomes BPW

(Cross) Entropy and Compression
Cross entropy rate per token
Example from :
https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf
Imagine two language models PӨ
and Pω
of vocab size
10000 and a random Language model R
Using all language model, compute per-word
cross-entropy of a “very large” sentence D
“Elephants are smart”
CE(D,PӨ
) =
CE(D, Pω
) =
CE(D,R) =
What does
that mean?

(Cross) Entropy and Compression
Example from :
Using all language model, compute per-word cross-entropy of a “very large”
sentence D “Elephants are smart”
CE(D,PӨ
) = 5
CE(D, Pω
) = 11
CE(D,R) = 13.287
Language Models encode some text statistics that can be used for
compression.
If we designed an optimal code based on each model, we could encode the
entire sentence in about:
PӨ
→ 5 x 3 = 15 bits
Pω
→ 11 x 3 = 33 bits
R → 13.287 x 3 = 39.861 bits
ASCII uses an average of 24 bits per word → 24 x 3 = 72 bits

Perplexity (PPL)
LM performance is often reported as perplexity rather than
cross-entropy.
Perplexity is simply:
2cross-entropy
(if CE uses log2)
Or
ecross-entropy
(if CE uses natural log)
6 bits cross-entropy means our model perplexity is 26
= 64: equivalent uncertainty to a uniform
distribution over 64 outcomes. I.e. the language model randomly chooses between 64 random
decisions at each time step.
Example from :
Reminder: cross entropy formula

How to interpret Cross-Entropy / Perplexity / BPC ?
Similar to all evaluation:
- The model could be good or the corpus too easy
- Only use it compare diﬀerent models on the same corpus
FYI - Break
Comparison of GPT-2 (Different model sizes) on language model objective Language
Models are Unsupervised Multitask Learners (Radford et al. 2019)

Entropy of English Language:
- Entropy is the average number of bits to encode the information contained in a random
variable
- CrossEntropy(D, P) is the average number of bits to encode the information contained in a
random variable D encoded using P
- The Entropy (amount of information) in English Language has been a popular topic across
linguists & computer scientists.
FYI - Break

Compression of English Language:
“The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression
improvements on enwik9 is the ﬁrst 1,000,000,000 characters of a speciﬁc version of English
Wikipedia. he prize awards 5000 euros for each one percent improvement
(with 500,000 euros total funding)” https://en.wikipedia.org/wiki/Hutter_Prize
FYI - Break

Preplexity Perplexity
Short Break

- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained on uncurated
Web text
Part 2: Stuff with Attention

Language Modeling
(Conditional) Language Modeling
Model probabilities of sequences of
words Conditioned on a Context
Context can be anything:
● Text
● Text in another language
● Image
● Speech
We know:
- How to calculate probabilities of sequences
- Recurrent Neural Networks
- How decode (generate) sequences

P(“I am hungry” | “J’ai Faim”) =
P(I | J’ai Faim) x
P( am | I , J’ai Faim) x
P( hungry | I am , J’ai Faim)
Chain Rule applies

Training: Conditional Language
Modeling
Context / input
Previous
generated tokens
Maximum likelihood estimation : Cross Entropy loss (-ve log likelihood)
loss also holds as a training objective

Decoding / Inference
All generation techniques can work in theory, some are more preferred
than others.
Local Global
Deterministic
Stochastic / Random
Greedy decoding MAP
Beam Search
Ancestrall (pure)
Sampling
Top k
Sampling
Nucleus
Sampling
Sampling with
Temperature

Decoding / Inference
All generation techniques can work in theory, some are more preferred
than others.
- In many tasks such as machine translation we care more about accuracy than diversity.
- I.e. ﬁnding the globally highest likely sequence for an input.
Only one output translation y will be
outputted for each x input , but that’s
ﬁne if it is correct.

Language Modeling Conditional Language Modeling
Modeling Modeling
Training Objective Training Objective
Decoding / Inference Decoding / Inference
Greedy, Ancestral sampling, Beamsearch,
Top-k, Nuclear sampling.
Greedy, Ancestral sampling, Beamsearch,
Top-k, Nuclear sampling.

Encoder-Decoder
<s> j’ ai faim
RNN RNN RNN RNN
RNN RNN RNN RNN
<s> am hungry
I
Initialize the decoder hidden state
With the encoder ﬁnal hidden state
Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
Encoder
Decoder

Encoder-Decoder
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
Cross
entropy
loss
Loss
function
Gradients updates
through
backpropagation
RNN RNN RNN RNN
<s> am hungry
I
In this architecture output of the
softmax layer of the encoder RNN
is not used.

You can actually use the same RNN as an encoder and decoder, however this is the implementation in the original
paper (Sutskever et al. 2014)
“because doing so increases the number model parameters at negligible computational cost and makes it natural to train the RNN
on multiple language pairs simultaneously.”
FYI - Break

- Open Problems of NLG Models trained on uncurated
Web text

Encoder-Decoder (limitations)
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
RNN RNN RNN RNN
<s> am hungry
I
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties (Conneau et al. ACL 2018)
The last hidden state is A single vector
representing the whole input (Bottle neck)!
The encoder is not able to compress the
whole sentence into one vector .

Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I
cost
RNN
<s>
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
Attention
Feed fwd
Weighted average of all
encoder hidden states
Decoder hidden
state is used as a query

Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am
cost cost
RNN RNN
<s> I
Attention
Decoder hidden
Feed fwd

Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry
cost cost cost
RNN RNN RNN
<s> am
I
Attention
Decoder hidden
Feed fwd

Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
RNN RNN RNN RNN
<s> am hungry
I
Attention
Decoder hidden
Feed fwd

Attention Mechanism (Luong et al. 2016)
Query (ht
): the hidden state of the decoder at time step t
Keys and Value: the hidden states of the encoder:
Attention Weights ɑt
(s):
Attention output (ct
): the output of the attention mechanism weighted sum of the Values according to the
attention weights.
Is combined with ht
to predict the decoder output

<s> j’ ai faim
RNN RNN RNN RNN
RNN
<s>
Feed fwd
ht
ct

3 type of Attention Weights calculations:
ht
dot
general
concat
x
x x
trainable
x
trainable
x
The output of all score functions above is a ﬂoat number

3 type of Attention Weights calculations:
ht
dot
general
concat
x
x x
trainable
x
trainable
x
The output of all score functions above is a ﬂoat number
global , local , and
location are three
other combinations
read about them in the
paper.

FYI - Break
In Machine Translation Visualizing Attention weightsis a common practice.
This shows words in the source sentence is important for the output of each word in the target
(alignments).
These alignments are learned end-to-end without explicit alignments between tokens in the source x
and target y sentences.

FYI - Break
Seq2Seq models are one of the landmarks of the Deep Learning revolution for NLP.
But soon got taken over by Self-Attention (transformers), which will see later in the next slides.
2018
GPT
Transformers
2020
2019 2021
2017
2016
2015
2014
Bahdanau et al.15
Luong et al. 16
Sutskever et al.
14
Hermann et al. Neurips 2015
Rush et al. EMNLP2015

Attention Is All You Need
Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗
Neurips2017
Transformers
“Attention is all you need”
Self attention
Residual
(skip)
connections
Positional
encoding
Seq2seq Seq2Seq with
Attention
Transformers
Encoding input RNN RNN Attention
Decoding output RNN RNN Attention
Encoder-Decoder
interaction
Fixed
vector
Attention Attention
Idea from lena-voita.github.io/nlp_course
N layers encoder and N layers
Decoder.
The last layer of encoder is
connected to all decoder layers
with Enc-dec attention
Decoder Self
attention
Encoder
decoder
attention

Attention Is All You Need
Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗
Neurips2017
Transformers
Lots of great resources to learn about Transformers:
- Original blogpost from google
- Lena Voita’s NLP For you
- Micheal Phi’s Illustrated guide to transformers
- Jay Alammar’s The illustrated Transformer
- The annotated transformer (Alexander Rush, Vincent Nguyen and Guillaume Klein)
- Karpathy’s minGPT
“Attention is all you need”
- Replace RNN with Attention
- Representing Sequence order using positional encoding
- Two types of attention:
- Self Attention (new!)
- Encoder-Decoder Attention
- Multi-heads for attention
- Skip connections allows better stacking to larger number of
layers
Encoder
decoder
attention
Self attention
Residual
(skip)
connections
Positional
encoding
Decoder Self
attention
N layers encoder and N layers
Decoder.
The last layer of encoder is
connected to all decoder layers
with Enc-dec attention

Transformers
Encoder
decoder
attention
Self attention
Decoder Self
attention
Multi-Head
Attention
Multi-Head Attention consists of several attention layers
running in parallel.
Each one is a “scaled dot-product attention”
The holy grail of
the transformers

Query Key Value
“The concepts come from retrieval
systems. The search engine will map your
query against a set of keys associated with
candidate results in the database, then
present you the best matched videos
(values).”
https://stats.stackexchange.com/a/424127/22327
Multi-head Attention

Encoder
Self
attention
Encoder
decoder
attention
Decoder Self
attention
3 instances of Multi-head attention:
1. Encoder self-attention
2. Decoder Self-attention (Masked)
3. Encoder-Decoder Attention

Encoder
Self
attention
Encoder Encoder Encoder
Encoder “Self-Attention”
Represent each token in
the encoder representation
(input) by attending on
other in the encoder (input)
Instead of encoding the
whole input using an RNN
allow tokens to look at each
other.

Decoder Self
attention
Decoder Decoder Decoder
Decoder “Self-Attention”
Represent previously
generated tokens token in
the decoder by attending on
other in the encoder (input)
Previously this was the
RNN Decoder hidden state.

Encoder
decoder
attention
Encoder Encoder Decoder
Decoder “Self-Attention”
Previously this was using
the RNN Decoder
“hidden state” to do
attention over the encoder
input representation

Multi-head Attention (Deeper look)
Still how does
multi-head attention
work?

Step 1: Linear projection of Key Query and Values
Wv
Wk
Wq
V K Q

Step 2: Dot product of Queries and Keys
K
Q

K
Q
x =
Attention scores
elephants
are
smart
elephants
are
smart
Step 2: Dot product of Queries and Keys
2 5 3
5 1 4
3 4 3
Q
K

2 5 3
5 1 4
3 4 3
Step 3 : Scale down Attention scores
divide by the square root of the dimension of query and key
“We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients”

2 5 3
5 1 4
3 4 3
Step 4 : Softmax of the Scaled Scores
Softmax across the Key dimension
(Each row)
Q
K

0.2 0.5 0.3
0.5 0.1 0.4
0.3 0.4 0.3
Step 4 : Softmax of the Scaled Scores
Q
K

0.2 0.5 0.3
0.5 0.1 0.4
0.3 0.4 0.3
Step 5 : Multiply Softmax to values
V
x
K2
K1
K3
Q1
Q2
Q3
V1
V2
V3
In this example we end up with 3 vectors each corresponds to the return of Q1
, Q2
and
Q3
. Each is a weighted average of the values vectors according to the attention
weights
=

What does it mean to be “Multi-head”
- Multiple parallel heads focus on different things each.
- Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention
with full dimensionality.
linear linear linear
Attention Head 1
V K Q
linear linear linear
Attention Head 2
V K Q
Scale down to
h is the number of parallel
heads (=2 here)
Embeddings of each token in
the input size dmodel
Output of each head
is of size
Concatenate outputs
of each attention
head
Wo
Output of a “2 head” multi-head
attention

Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
Now we know what does it
mean by “self-attention”
and“ multi-head attention”
Still too many
parts ...
5 Output layer

Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer

Embedding layer
Elephants are smart
Word embeddings
Positional embeddings
Positional embeddings
Since we are not using RNNs positional embeddings keep word
order information
Elephants are smart
0 1 2
smart are elephants
2 1 0
Odd Index: create a vector using the cos function.
Even index: create a vector using the sin function.
- Motivation: would allow the model to easily learn to attend by
relative positions.
- In practice: Indifferent from learned “positional embeddings”
- Allows longer sequences at test time than those seen during
training

Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2

Encoder Self-attention
Multi-Head
Attention
Elephants are smart
The input will be copied
3 times as Q, K and V

Encoder Residual connections
Elephants
Deep residual learning for image recognition CVPR 2015
https://arxiv.org/pdf/1512.03385.pdf
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
“We provide comprehensive
empirical evidence showing
that these residual networks
are easier to optimize, and
can gain accuracy from
considerably increased
depth”
Residual
(skip)
connections
The signal
ﬂows from
bottom to top
of the layer
(skip).
are smart

Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer

Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder
Masked Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer

Decoder (Masked) Self-attention
éléphants sont intelligents
Imagine a machine translation example
Source(input) elephants are smart
Target(output) éléphants sont intelligents
sont
We are are at time step 2:
The Transformer generated a probability distribution
corresponding to the correct token “éléphants”
Now it is expected now to generate a probability
distribution corresponding to the token “sont”
<BOS>

Imagine a machine translation example
Source(input) elephants are smart
Target(output) éléphants sont intelligents
sont
We are are at time step 2:
The Transformer generated a probability distribution
corresponding to the correct token “éléphants”
Now it is expected now to generate a probability
distribution corresponding to the token “sont”
Problem:The answer is already given!
The model will have nothing to predict it
will just echo the input to the output.
<BOS>

sont
Masked Self-Attention prevents the decoder from looking ahead.
This is done inside the decoder
Multi-head self-attention
éléphants
sont
intelligents
éléphants
sont
intelligents
1 4 4 1
4 3 2 1
4 2 3 1
1 1 1 2
Q
K
<BOS>
<BOS>
<BOS>
0 -inf -inf -inf
0 0 -inf -inf
0 0 -inf -inf
0 0 0 -inf
x
Timestep 1: When the target
word is “elephant” you can get
values corresponding to
“<BOS>” from the input
word is “sont”
you can only see “<BOS>” and
“éléphants”
Mask

<EOS>
Masked Self-Attention prevents the decoder from looking ahead.
This is done inside the decoder
Multi-head self-attention
éléphants
sont
intelligents
éléphants
sont
intelligents
1 4 4 1
4 3 2 1
4 2 3 1
1 1 1 2
Q
K
<BOS>
<BOS>
<BOS>
0 -inf -inf -inf
0 0 -inf -inf
0 0 0 -inf
0 0 0 0
x
word is “ <EOS>”
you can only see the whole
input.

Decoder-Encoder Attention
Encoder
decoder
attention
<BOS>
Elephants are smart Elephants are smart
Embeddings of the
last layer of the
encoder are used as
keys and values
Embeddings of
the *layer N* of
the Decoder are
used as keys and
values

Encoder
Self
attention
Encoder
decoder
attention
Decoder Self
attention
1
3
5
Residual
(skip)
connections
2
Almost there!
Output layer

<BOS>
Output Layer + Loss Function
Output of each
transformer layer is
similar dimension to the
encoded input
Linear
Transformation
x =
dmodel
dmodel
|Vocab|
|Vocab|
|Vocab|
softmax =
éléphants
sont
intelligents
<EOS>
No RNN here feed all output tokens at once and
calculate loss MASKED self-attention will take care
of illegal connections
cost
cost
cost
cost
Reference sentence
(1 hot encoded)
Reference sent
delayed by 1
time step
Same Cross entropy
loss

Question Time
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer

Q: Transformers are seq2seq can we
use them for unconditional LM?

Yes but redundant
Elephants are Elephants are
Smart

GPT
Yes but redundant
Masked
Multi-Head
Attention
Add &Norm
Add & Norm
Feed Forward
Decoder only Transformer
Elephants are Elephants are
Smart
- No Encoder
- No Decoder-Encoder
Attention
Last Technical slide

Q: How such “transform”ative ideas come up?
A: Good teamwork between 8 authors
FYI - Break
“Equal contribution. Listing order is random.
Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.
Ashish, with Illia, designed and implemented the first Transformer models.
Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position
representation.
Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor.
Llion also experimented with novel model variants, our initial codebase, and efficient inference and
visualizations.
Lukasz and Aidan designing various parts of and implementing tensor2tensor library”

- Open Problems of NLG Models trained
on uncurated Web text

Large Language Models
And Their Dangers

177
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Rush et al. EMNLP2015
GPT
“Our approach is a combination
of two existing ideas:
transformers and unsupervised
pre-training.”
GPT was created
Originally for NLU !

178
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Sources: The Guardian, the next web
“As an experiment in responsible
disclosure, we are instead releasing a
much smaller model for researchers to
experiment with.”

179
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Big claims on
unprecedented
generation
capabilities!

180
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Sources: The Guardian, the next web

181
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Criticisms to Neural Language Generation !
Neural Unicorns
should be put
on a leash

182
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Holtzman et al. ICLR2020
Degeneration

183
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
“
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Symbolic AI & Semantic Correctness
“for a human or a machine to
learn a language, they must
solve what Harnad (1990) calls
the symbol grounding
problem.”
Form vs Meaning
Fluency ≠ Semantic Correctness
O observes that certain words tend to occur in similar
contexts .. learns to generalize across lexical patterns
by hypothesizing that they can be used
interchangeably.
O has never observed these objects, and thus would not
be able to pick out the referent of a word when
presented with a set of (physical) alternatives.

Petroni et al. EMNLP 2019
184
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Jiang et al. TACL 2020
Seq2Seq
Attention
Abstractive summ.
Factual Correctness
Kassber al. ACL 2020

185
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Discussions around AI Ethics 🚨
Gender Shades: Buolamwini 2017

186
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
source: MIT Technologyreview
source: https://twitter.com/minimaxir/

187
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
It is not about only single examples. Itʼs also a Distributional Bias
Abubakar abid
keynote at
#muslimsinAI
workshop in
Neurips 2020
https://twitter.com/shakir_za/status/1336335755656929288?lang=en
https://twitter.com/abidlabs/status/1291165311329341440?lang=en

188
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Prates al. Neural computation 2019
Seq2Seq
Attention
Abstractive summ.
Distributional Bias

189
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Distributional Bias
Stanovsky et al. ACL2019

190
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Distributional Bias (Open ended NLG )
Sheng et al. EMNLP 2019
Sentiment

191
2018
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Distributional Bias (cloze style)
Nadeem et al. 2020

192
2018
GPT Gebru et al. FACcT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Seq2Seq
Attention
Abstractive summ.
Timnit Gebru [Left] Margaret Mitchell [right], Were fired from google over “The Stochastic parrots” paper
https://www.wired.com/story/second-ai-researcher-says-fired-google/
Read the paper (Bender et al. FAccT21)

Huh .. ?
Hady Elsahar
@hadyelsahar
Hady.elsahar@naverlabs.com
That’s All folks
Great
Help me to make this tutorial better please participate
in this anonymous survey :
https://forms.gle/Xr93EFiY2zStksMK8
Also reach out for
feedback or questions

Neural Language Generation Head to Toe

More Related Content

What's hot

Similar to Neural Language Generation Head to Toe

Recently uploaded

Neural Language Generation Head to Toe