NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Hady Elsahar
@hadyelsahar
Hady.elsahar@naverlabs.com
1
Neural Language
Generation
Head to Toe
NLG Head to Toe @hadyelsahar
PhD.
2019
Research
Scientist
2020
About Me
http://hadyelsahar.io
Intern.
2018
Intern.
Intern.
Research interests:
Controlled NLG
- Distributional Control of Language
generation
- Energy Based Models, MCMC
- Self-supervised NLG
Domain Adaptation
- Domain shift detection
- Model Calibration
Side Gigs I actively participate in wikipedia
research. I build tools to help editors from
under-resourced language. (like Scribe)
Masakhane A grassroots NLP community for
Africa, by Africans
https://www.masakhane.io/
2013
2014
Masters
2015
Side gigs
@hadyelsahar
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Great Resources Online
That helped developing this tutorial
Lena Voita - NLP Course | For You
https://lena-voita.github.io/nlp_course.html
CS224n: Natural Language Processing with Deep Learning
Stanford / Winter 2021
http://web.stanford.edu/class/cs224n/
Lisbon Machine Learning School
http://lxmls.it.pt/2020/
Speech and Language Processing [Book]
Dan Jurafsky and James H. Martin
https://web.stanford.edu/~jurafsky/slp3/
Courses Books
You are smart & the Internet is full of great
resources. But might be also confusing, many of the
parts in this tutorials took me hours to grasp. I am
only here to make it easier for you :)
Nothing in this tutorial you cannot find online
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
What are we going to Learn
today?
- Introduction to Language Modeling
- Recurrent Neural Networks (RNN)
- How to generate text from Neural
Networks
- How to Evaluate Language Models
Seq2seq Models
- Conditional Language Model
- Seq2seq with Attention
- Transformers
Open Problems of NLG Stochastic Parrots
Part 1: Language Modeling Part 2: Stuff with Attention
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Part 1: Language Modeling
- Introduction to Language Modeling
- N-gram Language Models
- Neural Language models
- Recurrent Neural Networks (RNN)
- Generating text from Language Models (fun!)
- Temperature Scaling
- Top-k / Nucleus Sampling
- LM Evaluation:
- Cross Entropy, Perplexity
NLG Head to Toe @hadyelsahar
Language Modeling
What is language modeling?
Assigning probabilities to sequences of words (or tokens).
P(“The cat sits on the mat”) = 0.0001
P(“Imhotep was an Egyptian chancellor to the Pharaoh Djoser”) = 0.004
P(“”Zarathushtra was an ancient spiritual leader who founded Zoroastrianism.) = 0.005
Why on earth? ...
NLG Head to Toe @hadyelsahar
Language Modeling
Language models are used everywhere!
Search Engines
P(“global warming is caused by”) = 0.1
P(“global warming is a phenomenon related to ”) = 0.05
P(“global warming is due to”) = 0.03
...
You can rank sentences (search Queries) by
their probability
NLG Head to Toe @hadyelsahar
Language Modeling
Language models are used everywhere!
Spell checkers
P(“... is Wednesday ….”) = 0.1
P(“... is Wendnesday …. ”) = 0.000005
...
You can recommend rewritings
based on probability of sequences.
NLG Head to Toe @hadyelsahar
Conditional Language Modeling
We can assign probabilities to sequences of words
Conditioned on a Context
What is a context ?
...
Context can be anything:
Image
● Speech
● Text
● Text in another
language
P(“I am hungry” | “J’ai Faim”) = 0.1
P(“I am happy” | “J’ai Faim”) = 0.002
P(“I was hungry” | “J’ai Faim”) = 0.0002
P(“he am hungry” | “J’ai Faim”) = 0.00001
NLG Head to Toe @hadyelsahar
Conditional Language Modeling
Conditional Language models are even more popular
Machine translation
P(“I am hungry” | “J’ai Faim”) = 0.1
P(“I am happy” | “J’ai Faim”) = 0.002
P(“I was hungry” | “J’ai Faim”) = 0.0002
P(“he am hungry” | “J’ai Faim”) = 0.00001
This is a notation for conditional
probability
Sequences with
correct translations
are given high probability
NLG Head to Toe @hadyelsahar
Conditional Language Modeling
Conditional Language models are even more popular
Abstractive Summarization
P( “COV D numbers are dropping down” | 📄📄📄) = 0.6
P( “COV D stats in France” | 📄📄📄) = 0.005
P( “ COV D the coronavirus pandemic” | 📄📄📄) = 0.0001
Better Summaries
are given high probabilities
NLG Head to Toe @hadyelsahar
Conditional Language Modeling
Conditional Language models are even more popular
Image captioning
P( “A dog eating in the park” | 🐕 🍔🌳 ) = 0.6
P( “A dog in the park” | 🐕 🍔🌳 ) = 0.03
P( “A cat in the tree” | 🐕 🍔🌳 ) = 0.002
Sequences with
correct captions
are given
high probability
We will learn that with
our first language model
Still, How to get
those probabilities
? ...
NLG Head to Toe @hadyelsahar
FYI - Break
P(“The cat sits on the mat”) = 0.0001
Now you know why assigning probabilities to sequences of words is
important. Letʼs see how we can do that. But first ..
Language Modeling
P(“I am hungry” | “J’ai Faim”) = 0.1
(Conditional) Language Modeling
We will start by
this one since it
is simpler
We will get to
this one later
NLG Head to Toe @hadyelsahar
🍼 A Very Naive Language Model
Calculate probability of a sentence given in a large amount of text.
P(“Elephants are the largest existing land animals.”) = ??
A dataset of 100k sentences
Count(“Elephants are the largest existing land animals.”) = 15
P(“Elephants are the largest existing land animals.”) = 15/100k = 0.00015
Mmmm.. What could possibly go wrong?
NLG Head to Toe @hadyelsahar
🍼 A Very Naive Language Model
Letʼs use it for spell checking
All sequences are equally wrong?...
Count (“ My Favourite day of the week is Wednesday ….”) = 0.00000
Count (“ My Favourite day of the week is Wendnesday ….”) = 0.00000
Count ( “ asdal;qw;e@@k__+0$%$^%….”) = 0.00000
P(“ My Favourite day of the week is Wednesday ….”) = 0.00000
P(“ My Favourite day of the week is Wendnesday ….”) = 0.00000
P(““ asdal;qw;e@@k__+0$%$^%….”) = 0.00000
Sequences are not in the dataset
A dataset of 100k
sentences
NLG Head to Toe @hadyelsahar
Question Time
How many unique (valid or invalid)
sentences of length 10 words
can we make out of english language?
If the Number of unique words in english = 1Million
NLG Head to Toe @hadyelsahar
Question Time
If the Number of unique words in english = 1Million
Answer: 1M x 1M x 1M … (10 times) = 1000000 10
= 1 x 10 60
Each time we have 1M word to select from.
How many unique (valid or invalid)
sentences of length 10 words
can we make out of english language?
NLG Head to Toe @hadyelsahar
Question Time
How many unique sentences of
MAX length 10 words
can we make out of english language?
Number of unique words in english = 1Million
NLG Head to Toe @hadyelsahar
Question Time
How many unique sentences of
MAX length 10 words
can we make out of english language?
Number of unique words in english = 1Million
Answer: 1M + 1M 2
+ 1M 3
+ 1M 4
+ …. + 1M 10
= 1.00000160
All sequences of
length 1 word
All sequences of
length 3 word
NLG Head to Toe @hadyelsahar
Combinatorial explosion
In Language Generation
Log
scale
Number of possible english
sentences of length 50 words is
~ 1x 10660
Number of atoms in the universe
~ 1x 1082
No dataset can have such
number of sentences. Most of
sentences will have zero
probabilities.
NLG Head to Toe @hadyelsahar
Using the chain rule, as follows:
P(Elephants are smart animals) =
P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart)
These are quite easy to find in a limited
size corpus.
Part of the problem is already solved!
This are still hard to find, we will learn
how to deal with those at a later point.
Atomic units of calculating probabilities
Became words instead
of full sentences
From Sequence Modeling
To modeling the probability of the next word
NLG Head to Toe @hadyelsahar
From Sequence Modeling
To modeling the probability of the next word
Using the chain rule, as follows:
w<t
is the notation for all
Words before time step t
W (in bold) is a
sequence of N
words
w1
w2
,.... wN
NLG Head to Toe @hadyelsahar
FYI - Break
There are terms associated to this method of modeling language:
“Left to Right language modeling” , “autoregressive Language models”
Other ways of modeling language (not discussed in this tutorial).
Bidirectional Language Modeling
Words in sentences are Generated
independently
Right context
Non-autoregressive Language Models
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Language Modeling
- Introduction to Language Modeling
- N-gram Language Models (Recap)
- Neural Language models
- Recurrent Neural Networks (RNN)
- Generating text from Language Models (fun!)
- Temperature Scaling
- Top-k / Nucleus Sampling
- LM Evaluation:
- Cross Entropy, Perplexity
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
N-Gram Language Models
NLG Head to Toe @hadyelsahar
N-Gram Language Models
sequence unigrams bigrams trigrams
n=1 n=2 n=3
Elephants are smart animals ● Elephants
● are
● smart
● animals
● Elephants are
● are smart
● smart animals
● Elephants are smart
● are smart animals
What is an
N-gram? ...
NLG Head to Toe @hadyelsahar
N-Gram Language Models
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
These are quite easy to find and count
in a limited size corpus.
We will learn how to deal with those now!
Recall this problem?
NLG Head to Toe @hadyelsahar
N-Gram Language Models
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
Assumption: N-gram language models uses the assumption that a probability of a word only
depends on it N previous tokens.
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
Tri-gram language model
NLG Head to Toe @hadyelsahar
N-Gram Language Models
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
Assumption: N-gram language models uses the assumption that a probability of a word only
depends on it N previous tokens.
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | are , smart, animals ) x
P(are | smart, animals , they) x
P(quite | animals , they , are) x
P(big | they , are, quite)
Tri-gram language model
Now these are easier to
find and count in a
corpus
NLG Head to Toe @hadyelsahar
FYI - Break
Markov Property
Born 14 June 1856 N.S.
Ryazan, Russian Empire
Died 20 July 1922 (aged 66)
Petrograd, Russian SFSR
Nationality Russian
https://en.wikipedia.org/wiki/Markov_property
P(Elephants are smart animals they are quite big) =
P(Elephants) x
P(are |Elephants ) x
P(smart | Elephants , are) x
P(animals| Elephants , are , smart) x
P(they | Elephants , are , smart, animals ) x
P(are | Elephants , are , smart, animals , they) x
P(quite | Elephants , are , smart, animals , they , are) x
P(big | Elephants , are , smart, animals , they , are, quite)
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Language Modeling
- Introduction to Language Modeling
- N-gram Language Models
- Neural Language models
- Recurrent Neural Networks (RNN)
- Generating text from Language Models (fun!)
- Temperature Scaling
- Top-k / Nucleus Sampling
- LM Evaluation:
- Cross Entropy, Perplexity
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Neural Language models
Recurrent Neural Networks (RNN)
NLG Head to Toe @hadyelsahar
Neural Language Modeling
Remember this?
w<t
is the notation for all
Words before time step t
* empty sequence
We are going to learn a neural network for this
Ө are the learnable params of
the Neural network.
Animation from NLP course | For you by Lena Voita
https://lena-voita.github.io/nlp_course/language_modeling.html
NLG Head to Toe @hadyelsahar
Language Modeling using Feed Fwd NN
Ө are the learnable params of
the Neural network.
Өe
Өw
1 hot encoding of
Words
Embedding
layer
Word
embeddings
;
|V| x
hembed
L x |V|
L x hembed
L hembed
x 1
concatenate
L hembed
x |V|
T
|V| x 1 |V| x 1
softmax
Projection
layer
These are
usually called
logits
Prob distribution over words
in vocab
Can we use a Feed Fwd Neural Network?
elephants
are
the
smartest
PӨ
(animals | elephants, are, the, smartest) = ?
animals = 0.1
Fixed width
Of L tokens
NLG Head to Toe @hadyelsahar
Өe
Өw
1 hot encoding of
Words
Embedding
layer
Word
embeddings
;
|V| x
hembed
L x |V|
L x hembed
L hembed
x 1
concatenate
L hembed
x |V|
T
|V| x 1 |V| x 1
softmax
Projection
layer
These are
usually called
logits
Prob distribution over words
in vocab
Can we use a Feed Fwd Neural Network?
elephants
are
the
smartest
PӨ
(animals | elephants, are, the, smartest) = ?
animals = 0.1
Fixed width
Of L tokens
Here L = 4
PӨ
(they| elephants, are, the, smartest, animals) → L > 4 (not possible)
PӨ
(they| elephants, are, the, smartest, animals) → Markov assumption (now possible)
PӨ
(the| elephants, are) → L < 4 (not possible)
PӨ
(they|<MASK> , <MASK> , elephants , are) → adding dummy tokens (now possible)
In practice this isn’t a good idea. The Problem with fixed length input
NLG Head to Toe @hadyelsahar
1. Recurrent (calculated repeatedly)
2. Great with Languages !
3. Have an internal memory called “Hidden state”
4. Can model infinite length sequences
5. No need for markov assumption
Recurrent Neural Networks
1 hot encoding of
Words
1 x |V|
h1
h<BOS>
elephants
RNN
+
P(are | elephants) = 0.1
are
RNN
+
P(smart| elephants , are) = 0.2
h2
h3
…….
…….
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
U
The core RNN is considered
only this part. As it operates
on sequences of continuous
vectors
elephants
are = 0.1
Probability of each word in the
vocabulary
Select the one corresponds to
the next token
smart
RNN
+
P(animals| elephants , are, smart) = 0.5
logits
xt
x.
t
yt
ŷt
NLG Head to Toe @hadyelsahar
Recurrent Neural Networks
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
RNN
For simplicity the figure doesn’t include the bias
terms b and c
U
Embed 1 hot vectors to
word embeddings.
Wi
could be trained or
kept frozen
Calculation of the hidden
state representations
ht
depends on ht-1
Project the hidden state
into the output space.
Calculate probabilities
out of the logits
logits
xt
x.
t
yt
NLG Head to Toe @hadyelsahar
Recurrent Neural Networks
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
For simplicity the figure doesn’t include the bias
terms b and c
xt
U
Embed 1 hot vectors to word
embeddings.
W
i
could be trained or kept
frozen
yi
t
x.
t
Calculation of the hidden
state representations
ht
depends on ht-1
Project the hidden state into
the output space.
Calculate probabilities
out of the logits
Wi
Wo
V U
are trainable
parameters
Ok but how to
train them ?
yt
Prob. of word i at time step t
given by the model
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identified by their larger ears ….
1) Collect large amount of free text
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identified by their larger ears ….
African elephants are the largest land animals on
Earth.
They are slightly larger than their Asian cousins.
They can be identified by their larger ears.
1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identified by their larger ears ….
African elephants are the largest land animals on
Earth.
They are slightly larger than their Asian cousins.
They can be identified by their larger ears.
<s> African elephants are the largest land animals on Earth . </s>
<s> They are slightly larger than their Asian cousins . </s>
<s> They can be identified by their larger ears . </s>
1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)
3) split into tokens
(e.g. word, char, sub-word units)
Add start of seq and end of seq tokens
<s> </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identified by their larger ears ….
African elephants are the largest land animals on
Earth.
They are slightly larger than their Asian cousins.
They can be identified by their larger ears.
<s> African elephants are the largest land animals on Earth . </s>
<s> They are slightly larger than their Asian cousins . </s>
<s> They can be identified by their larger ears . </s>
1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)
3) split into tokens
(e.g. word, char, sub-word units)
Add start of seq and end of seq tokens
<s> </s>
0 <s>
1 </s>
2 They
3 on
4 elephants
5 African
6 larger
..
1200 ears
4) Build a vocabulary V
More on
tokenization in
the next lecture
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
(data preprocessing )
African elephants are the largest land animals on
Earth. They are slightly larger than their Asian
cousins. They can be identified by their larger ears ….
African elephants are the largest land animals on
Earth.
They are slightly larger than their Asian cousins.
They can be identified by their larger ears.
<s> African elephants are the largest land animals on Earth . </s>
<s> They are slightly larger than their Asian cousins . </s>
<s> They can be identified by their larger ears . </s>
1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)
3) split into tokens
(e.g. word, char, sub-word units)
Add start of seq and end of seq tokens
<s> </s>
0 <s>
1 </s>
2 They
3 on
4 elephants
5 African
6 larger
..
1200 ears
4) Build a vocabulary V
More on
tokenization in
the next lecture
- [0, 2 , 4 ,5, 6, 7, 8, 8, 101, 22, 1]
- [0, 22, 45, 65, 78, 9, 3, 4, 2, 1]
- [0, 1, 23, 3, 4, 5, 65, 7, 7, 8, 1]
4) Index training data
Each word (token) can be represented as a
one hot vector now!
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s>
African
cost
<s> African elephants are smart </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s>
African elephants
cost cost
<s> African elephants are smart </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s> African elephants
African elephants are
cost cost cost
<s> African elephants are smart </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s> African elephants are
African elephants are smart
cost cost cost cost
<s> African elephants are smart </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s> African elephants are smart
African elephants are smart </s>
cost cost cost cost cost
Cross
entropy loss
<s> African elephants are smart </s>
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s> African elephants are smart
African elephants are smart </s>
cost cost cost cost cost
Cross
entropy loss
<s> African elephants are smart </s>
Loss function
Gradients updates
through
backpropagation
NLG Head to Toe @hadyelsahar
Training Recurrent Neural Networks
[0, 2 , 4 ,5, 6, 10]
<s> African elephants are smart
African elephants are smart </s>
cost cost cost cost cost
Cross
entropy loss
<s> African elephants are smart </s>
Loss function
Gradients updates
through
backpropagation
How to calculate
the cross
entropy loss?
NLG Head to Toe @hadyelsahar
Cross Entropy
Claude Shannon Page on wikipedia
The “surprisal” of PӨ
(empirical distribution) for samples generated from D (the true
data distribution).
Cross Entropy can also be seen as a “closeness" measure between two distributions.
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Cross entropy
direction
matters
NLG Head to Toe @hadyelsahar
Cross Entropy loss (-ve log likelihood)
cost
Output distribution
of tokens in the vocab
By your RNN at time
step t
yn
t i
= 1 if i is the correct token
yn
t i
= 0 if i is not the correct token
yn
t i
Training Example
Position in the vocab , logits,
probability vectors or 1 hot
vectors
Time step t
pn
t i
prob of the model
to the token i in example n and
time step t
Let’s simplify the notation:
yn
t i_correct =
yn
t
= 1
pn
t i_correct =
pn
t
No need
to write
Prob of correct
token by the model
Prob of correct
token given
previous context
Prob of correct
sequence
Negative Log Likelihood
N = number of training examples
NLG Head to Toe @hadyelsahar
Maximum log likelihood loss
Negative Log Likelihood
Minimize loss → minimize Negative log likelihood → Maximum log likelihood Estimation (MLE)
This objective is usually written like that
Probability of the true sequence y given a language
model parameterized by Ө
Find the parameters Ө that
minimizes the -ve log likelihood
Usually done using SGD
NLG Head to Toe @hadyelsahar
Practical Tips (parameter sharing)
Wi
Wo
1 hot encoding of
Words
Embedding
layer
Word
embeddings
|V| x Hembed
1 x |V|
1 x Hembed
1 x |V|
softmax
ŷt
V
1 x |V|
Output
Embedding
layer
|V| x Hstate
+
Hidden States
ht
1 x Hstate
Previous hidden
sate
ht-1
logits
RNN
Embedding layer
|V| x Hembed
Output
Embedding layer
|V| x Hstate
For simplicity the figure doesn’t include the bias
terms b and c
xt
U
yi
t
x.
t
yt
Prob. of word i at time step t
given by the model
Wi
Wo
FYI - Break
It is a common practice to unify the embedding sizes Across the whole
network. Hstate
= Hembed
This makes both input / output embedding layers have the same
dimensionality. You can tie their weights to reduce parameter size of the
RNN . (this is called weight tying / parameter sharing)
Share both as one
matrix
NLG Head to Toe @hadyelsahar
RNNs enjoy great flexibility
FYI - Break
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLG Head to Toe @hadyelsahar
The last “hidden state” of a Recurrent Neural Networks could be a
sentence repsententation, that can be used later for many tasks e.g.
Classification.
FYI - Break
<s> African elephants are smart
Feed Fwd
Neural
Network
Sentiment = positive
Classification loss
NLG Head to Toe @hadyelsahar
You can Stack Recurrent Neural Networks
FYI - Break
RNN RNN RNN RNN
RNN
RNN
RNN
RNN
Layer 2 RNN runs over a
sequence of “hidden states”
(not softmax output)
of Layer 1 RNNs
NLG Head to Toe @hadyelsahar
Can I use this RNN to
generate text?
Now we know how to
model text probabilities
using Recurrent Neural
Networks.
Decoding | Inference
After all, it is called Neural Language “Generation”
NLG Head to Toe @hadyelsahar
Let’s see a
demo !!
Demo Break
Write With Transformer
https://transformer.huggingface.co/doc/gpt
https://beta.openai.com/playground
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Language Modeling
- Introduction to Language Modeling
- N-gram Language Models
- Neural Language models
- Recurrent Neural Networks (RNN)
- Generating text from Language Models
- Greedy Decoding
- Temperature Scaling
- Top-k / Nucleus Sampling
- LM Evaluation:
- Cross Entropy, Perplexity
NLG Head to Toe @hadyelsahar
Decoding | Inference
<s>
RNNs output a categorical distribution over Tokens in the vocab at each step
pӨ
( * | <s>)
Elephants
I
He
They
….
smart
animals
Giraffes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
Auto-regressive decoding
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
1) Select next token (we will see later different selection methods)
Elephants
He
They
I
….
smart
animals
Giraffes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
<s>
Auto-regressive decoding
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
1) Select next token
Elephants
He
They
I
….
smart
animals
Giraffes
Think
</s>
….
….
0.4
0.03
0.02
0.001
….
0.02
0.001
0.016
0.2
0.001
….
….
<s>
Elephants
Auto-regressive decoding
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
2) Feed selected token (auto-regressively) to the RNN and calculate pӨ
( * | <s>, I)
Elephants
He
They
I
….
are
animals
Giraffes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
<s> Elephants
Elephants
Auto-regressive decoding
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
3) Select next token
Elephants
He
They
I
….
are
animals
Giraffes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
<s> Elephants
Elephants
Auto-regressive decoding
are
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
Repeat the process until end of sequence token </s> or max length is reached.
Auto-regressive decoding
<s> Elephants
Elephants are
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
Repeat the process until end of sequence token </s> or max length is reached.
Auto-regressive decoding
<s> Elephants are
Elephants are smart
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
Repeat the process until end of sequence token </s> or max length is reached.
Auto-regressive decoding
<s> Elephants are smart
Elephants are smart animals
NLG Head to Toe @hadyelsahar
Decoding | Inference
RNNs output a Multinomial distribution over Tokens in the vocab at each step
Repeat the process until end of sequence token </s> or max length is reached.
Auto-regressive decoding
<s> Elephants are smart animals
Elephants are smart animals </s>
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Elephants
He
They
I
….
are
animals
Giraffes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.2
0.001
0.016
0.2
0.001
….
….
The maximum value?
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Ancestrall (pure) Sampling
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Local Global
Deterministic
Stochastic / Random
Greedy decoding MAP
Beam Search
Ancestrall (pure) Sampling
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
NLG Head to Toe @hadyelsahar
Greedy Decoding
Select the token with max probability
Elephants
He
They
I
….
are
animals
Giraffes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.25
0.001
0.016
0.2
0.001
….
….
NLG Head to Toe @hadyelsahar
Question Time
During Greedy decoding, at each step the most likely token is selected:
Will that generate the highest likely sequence overall?
a) Yes, always
b) No (but it could happen)
c) Never
NLG Head to Toe @hadyelsahar
Answer
“local” vs “Global” likelihood in sequence generation.
a
b
a
b
*
0.7
0.3
0.55
0.45
a
b
a
b
0.45
Imagine vocabulary of two tokens “a” and “b”
Run greedy decoding for 3 time steps.
Selected sequence “a b b”
P(“a b b”) = 0.6 * 0.55 * 0.55 = 0.1815
Other sequences could have globally higher
probability:
P(“b a b”) = 0.4 * 0.8 * 0.9 = 0.288
a
b
0.1
0.2
0.8
0.9
0.55
NLG Head to Toe @hadyelsahar
Question Time
Given a trained Language model pӨ
If we run greedy decoding for the context: “<s>” until “</s>” is obtained.
We repeat this process 1000 times.
How many unique sequences will be obtained?
a) 1000
b) Infinity
c) 1
d) 42
e) 75000
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Ancestrall (pure) Sampling
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
NLG Head to Toe @hadyelsahar
Ancestral (Pure) Sampling
Also called “Pure” Sampling, Standard Sampling, or just Sampling.
Sampling is stochastic (random) ≠ deterministic
Elephants
He
They
I
….
are
animals
Giraffes
Think
</s>
….
….
0.0002
0.03
0.02
0.001
….
0.25
0.001
0.016
0.2
0.001
….
….
Pure sampling will obtain “unbiased”
samples.
I.e. distribution of generated sequences
matches the Language model
distribution over sequences.
NLG Head to Toe @hadyelsahar
Question Time
During Pure Sampling, at each step x is sampled:
Will that generate the highest likely sequence overall?
a) Yes, always
b) No (but it could happen)
c) Never
NLG Head to Toe @hadyelsahar
Question Time
Given the following conditional probabilities of a trained language
model, If we run pure sampling 10000 times and Greedy 1000 times.
How many times the sequence “b b” will be obtained:
a) 0 using Greedy & 100 using sampling
b) 100 using Greedy & 10 using sampling
c) 10 times Greedy & 10 using sampling
d) 0 times Greedy & 10000 using sampling
a
b
a
b
*
0.9
0.55
0.45
a
b
0.1
0.9
0.1
Start of
sequence
(empty)
p(b| *)
p(b| b * )
p(a| *)
p(a| a *)
p(b | a *)
NLG Head to Toe @hadyelsahar
Ancestral (Pure) Sampling
Pros
- Diversity in generations
(not always the same sequence)
- Generated samples reflect the
Language Model probability
distribution of sequences (Unbiased).
Cons
Pure sampling sometimes lead to incoherent
text
Fig. from THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Ancestrall (pure) Sampling
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
NLG Head to Toe @hadyelsahar
Sampling with Temperature
Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier
I.e. less likely to sample from unlikely candidates
Higher temperature produces a softer probability distribution over tokens,
resulting in more diversity and also more mistakes.
Divide the logits by a
temperature
(Constant) value T
As T decreases
( yi
/ T ) increases
|V| x 1 |V| x 1
softmax
These are
usually called
logits
Prob distribution over words
in vocab
p(xt
| x<t
) = 0.1
➗ T
Good read on the topic https://medium.com/@majid.ghafouri/why-should-we-use-temperature-in-softmax-3709f4e0161
NLG Head to Toe @hadyelsahar
Sampling with Temperature
Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature
Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier
I.e. less likely to sample from unlikely candidates
NLG Head to Toe @hadyelsahar
Sampling with Temperature
Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature
Higher temperature (> 1) produces a softer probability distribution over
tokens, resulting in more diversity and also more mistakes.
NLG Head to Toe @hadyelsahar
Let’s see a demo
!!
Demo
Write With Transformer
https://transformer.huggingface.co/doc/gpt
https://beta.openai.com/playground
NLG Head to Toe @hadyelsahar
Top-k Sampling
Hierarchical Neural Story Generation (Fan et al. 2018)
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.3
0.25
0.1
0.1
0.03
0.02
0.01
0.01
0.01
….
….
K = 4
0.3
0.25
0.1
0.1
0.399
0.333
0.133
0.133
normalize Sample
At each timestep, randomly sample from the k most likely candidates from the token distribution
He
NLG Head to Toe @hadyelsahar
Top-k Sampling
Hierarchical Neural Story Generation (Fan et al. 2018)
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.3
0.25
0.1
0.1
0.03
0.02
0.01
0.01
0.01
….
….
K = 4
A fixed size k in Top-k sampling is not always a good idea:
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.6
0.3
0.02
0.02
0.02
0.02
0.01
0.01
0.01
….
….
K = 4
NLG Head to Toe @hadyelsahar
Top-p (Nucleus) Sampling
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.3
0.2
0.1
0.05
0.03
0.02
0.01
0.01
0.01
….
….
p = 0.6
0.3
0.2
0.1
0.5
0.333
0.166
normalize Sample
Elephants
Sample from the top p % of the probability mass
THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
NLG Head to Toe @hadyelsahar
Top-p (Nucleus) Sampling
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.3
0.3
0.5
0.1
0.03
0.02
0.01
0.01
0.01
….
….
Top-p = Adaptive top-k
Elephants
He
They
I
are
animals
Giraffes
Think
</s>
…….
…...
0.4
0.4
0.02
0.02
0.02
0.02
0.01
0.01
0.01
….
….
p = 0.8
p = 0.8
THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
NLG Head to Toe @hadyelsahar
Decoding | Inference
How to “Select” the next token given a categorical distribution over Tokens in the
vocab.
Auto-regressive decoding
Local Global
Deterministic
Stochastic / Random
MAP
Beam Search
Ancestrall (pure) Sampling
Greedy decoding
Top k
Sampling
Nucleus
Sampling
Sampling
with
Temperature
Next Lecture!
NLG Head to Toe @hadyelsahar
Demo again!
Demo Break
Write With Transformer
https://transformer.huggingface.co/doc/gpt
https://beta.openai.com/playground
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Language Model Evaluation
& More about “Cross-Entropy”
NLG Head to Toe @hadyelsahar
Language Modeling Evaluation
For classification Higher Accuracy is always better:
● Accuracy = 80% Better than 60%
But for language modeling?
A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Language Model Is that
good?
P(x) = 0.001
Test set x
NLG Head to Toe @hadyelsahar
Language Modeling Evaluation
Intrinsic Metrics
● perplexity
● cross entropy
● bits-per-character (BPC)
A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Extrinsic Metrics
Language Model
X ~ PӨ
Grammatically correct
Fluent
Coherent
What are
these?
NLG Head to Toe @hadyelsahar
Background: Entropy (information theory)
Claude Shannon Page on wikipedia
Imagine a process that generates samples e.g. Language Model PӨ
that generates a
sequence xi
~ PӨ
–log(PӨ
(xi
)) is the “surprisal”
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
PӨ
(xi
)
–log(PӨ
(xi
))
If a model generates samples with low probability it will
be have high surprisal on them
Samples with higher probability the model is
confident about them (i.e. low surprisal)
NLG Head to Toe @hadyelsahar
Background: Entropy (information theory)
Claude Shannon Page on wikipedia
Imagine a process that generates samples e.g. Language Model PӨ
that generates a
sequence xi
Entropy is the Expected level of “surprisal”
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Entropy is usually
denoted by H
How on average the model
is surprised by its
samples. I.e. how
unconfident it is about its
samples
This is expectations
= “mean value”
Expectations in theory is calculated using infinite samples or
closed form but could be approximated using large N samples
NLG Head to Toe @hadyelsahar
Background: Entropy (information theory)
Claude Shannon Page on wikipedia
What does it mean that your model has low Entropy?
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Q : Imagine a language model that generates only one sample x= “elephants are
smart” with PӨ
(x) = 1 , what is the entropy of this language model H(PӨ
) ?
a) 1
b) zero
c) infinity
● Low entropy tells you that your model
is not random (i.e. learned something)
● But it could be confident about the wrong
things.
NLG Head to Toe @hadyelsahar
Background: Entropy (information theory)
Claude Shannon Page on wikipedia
What does it mean that your model has low Entropy?
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Q : Imagine a language model that generates only one sample x= “elephants are
smart” with PӨ
(x) = 1 , what is the entropy of this language model H(PӨ
) ?
a) 1
b) zero
c) infinity
No “surprisal” here the model is
so confident about the only example it generates!
NLG Head to Toe @hadyelsahar
Background: Entropy (information theory)
Claude Shannon Page on wikipedia
What does it mean that your model has low Entropy?
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Q : Imagine a language model that generates only one sample x= “elephants are
smart” with PӨ
(x) = 1 , what is the entropy of this language model H(PӨ
) ?
a) 1
b) zero
c) infinity
No “surprisal” here the model is
so confident about the only example it generates!
What did you teach us
about it if it is a bad
metric!
NLG Head to Toe @hadyelsahar
Cross Entropy
Claude Shannon Page on wikipedia
The “surprisal” of PӨ
for samples generated from D (the true Language distribution).
Cross Entropy can also be seen as a “closeness" measure between two distributions.
Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
Entropy
Cross entropy
direction
matters
NLG Head to Toe @hadyelsahar
Cross Entropy rate per token
Given that we are interested in sentences (sequences of tokens) of length n, we will
use the the entropy rate per token:
Cross entropy rate per token
Cross entropy
This equality is true on in the limit when n is infinitely long, more
details you can see Shannon-McMillan-Breiman theorem
We can estimate the cross-entropy measuring model log prob on a
random sample of sentences or a very large chunk of text.
How do we know the
probability of the
true prob. Of a
sentence in the
whole language?
Large
number of
random
samples
NLG Head to Toe @hadyelsahar
Which log to use ?
In all the previous theory, the entropy and cross
entropy are defined using log base 2 (with "bit" as
the unit),
“Popular machine learning frameworks, implement
cross entropy loss using natural log. As it is faster
to compute natural log as opposed to log base 2.”
It is often not reported in papers which log they
use, but mostly it is safe to assume the “natural
log”
Source: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Natural log
base e
Log base 2 is used in
“information” theory due to
its relation with bits and
bytes
NLG Head to Toe @hadyelsahar
Bits per Character BPC
Cross entropy rate per word /char / token
Bits per (character | Token | word)
BPC BPT BPW
If log base 2 is used, Cross-Entropy per
word becomes BPW
NLG Head to Toe @hadyelsahar
(Cross) Entropy and Compression
Cross entropy rate per token
Example from :
https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf
Imagine two language models PӨ
and Pω
of vocab size
10000 and a random Language model R
Using all language model, compute per-word
cross-entropy of a “very large” sentence D
“Elephants are smart”
CE(D,PӨ
) =
CE(D, Pω
) =
CE(D,R) =
What does
that mean?
NLG Head to Toe @hadyelsahar
(Cross) Entropy and Compression
Example from :
https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf
Using all language model, compute per-word cross-entropy of a “very large”
sentence D “Elephants are smart”
CE(D,PӨ
) = 5
CE(D, Pω
) = 11
CE(D,R) = 13.287
Language Models encode some text statistics that can be used for
compression.
If we designed an optimal code based on each model, we could encode the
entire sentence in about:
PӨ
→ 5 x 3 = 15 bits
Pω
→ 11 x 3 = 33 bits
R → 13.287 x 3 = 39.861 bits
ASCII uses an average of 24 bits per word → 24 x 3 = 72 bits
NLG Head to Toe @hadyelsahar
Perplexity (PPL)
LM performance is often reported as perplexity rather than
cross-entropy.
Perplexity is simply:
2cross-entropy
(if CE uses log2)
Or
ecross-entropy
(if CE uses natural log)
6 bits cross-entropy means our model perplexity is 26
= 64: equivalent uncertainty to a uniform
distribution over 64 outcomes. I.e. the language model randomly chooses between 64 random
decisions at each time step.
Example from :
https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf
Reminder: cross entropy formula
NLG Head to Toe @hadyelsahar
How to interpret Cross-Entropy / Perplexity / BPC ?
Similar to all evaluation:
- The model could be good or the corpus too easy
- Only use it compare different models on the same corpus
FYI - Break
Comparison of GPT-2 (Different model sizes) on language model objective Language
Models are Unsupervised Multitask Learners (Radford et al. 2019)
NLG Head to Toe @hadyelsahar
Entropy of English Language:
- Entropy is the average number of bits to encode the information contained in a random
variable
- CrossEntropy(D, P) is the average number of bits to encode the information contained in a
random variable D encoded using P
- The Entropy (amount of information) in English Language has been a popular topic across
linguists & computer scientists.
FYI - Break
NLG Head to Toe @hadyelsahar
Compression of English Language:
“The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression
improvements on enwik9 is the first 1,000,000,000 characters of a specific version of English
Wikipedia. he prize awards 5000 euros for each one percent improvement
(with 500,000 euros total funding)” https://en.wikipedia.org/wiki/Hutter_Prize
FYI - Break
NLG Head to Toe @hadyelsahar
Preplexity Perplexity
Short Break
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
- Conditional Language Model
- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained on uncurated
Web text
Part 2: Stuff with Attention
NLG Head to Toe @hadyelsahar
P(“The cat sits on the mat”) = 0.0001
Language Modeling
P(“I am hungry” | “J’ai Faim”) = 0.1
(Conditional) Language Modeling
Conditional Language Modeling
Model probabilities of sequences of
words Conditioned on a Context
Context can be anything:
● Text
● Text in another language
● Image
● Speech
We know:
- How to calculate probabilities of sequences
- Recurrent Neural Networks
- How decode (generate) sequences
NLG Head to Toe @hadyelsahar
Conditional Language Modeling
P(“I am hungry” | “J’ai Faim”) =
P(I | J’ai Faim) x
P( am | I , J’ai Faim) x
P( hungry | I am , J’ai Faim)
Chain Rule applies
NLG Head to Toe @hadyelsahar
Training: Conditional Language
Modeling
Context / input
Previous
generated tokens
Maximum likelihood estimation : Cross Entropy loss (-ve log likelihood)
loss also holds as a training objective
NLG Head to Toe @hadyelsahar
Decoding / Inference
All generation techniques can work in theory, some are more preferred
than others.
Local Global
Deterministic
Stochastic / Random
Greedy decoding MAP
Beam Search
Ancestrall (pure)
Sampling
Top k
Sampling
Nucleus
Sampling
Sampling with
Temperature
NLG Head to Toe @hadyelsahar
Decoding / Inference
All generation techniques can work in theory, some are more preferred
than others.
- In many tasks such as machine translation we care more about accuracy than diversity.
- I.e. finding the globally highest likely sequence for an input.
P(“I am hungry” | “J’ai Faim”) = 0.1
P(“I am happy” | “J’ai Faim”) = 0.002
P(“I was hungry” | “J’ai Faim”) = 0.0002
P(“he am hungry” | “J’ai Faim”) = 0.00001
Only one output translation y will be
outputted for each x input , but that’s
fine if it is correct.
NLG Head to Toe @hadyelsahar
Language Modeling Conditional Language Modeling
Modeling Modeling
Training Objective Training Objective
Decoding / Inference Decoding / Inference
Greedy, Ancestral sampling, Beamsearch,
Top-k, Nuclear sampling.
Greedy, Ancestral sampling, Beamsearch,
Top-k, Nuclear sampling.
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
- Conditional Language Model
- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained on uncurated
Web text
Part 2: Stuff with Attention
NLG Head to Toe @hadyelsahar
Encoder-Decoder
<s> j’ ai faim
RNN RNN RNN RNN
RNN RNN RNN RNN
<s> am hungry
I
Initialize the decoder hidden state
With the encoder final hidden state
Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
Encoder
Decoder
NLG Head to Toe @hadyelsahar
Encoder-Decoder
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
Cross
entropy
loss
Loss
function
Gradients updates
through
backpropagation
RNN RNN RNN RNN
<s> am hungry
I
In this architecture output of the
softmax layer of the encoder RNN
is not used.
Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
NLG Head to Toe @hadyelsahar
You can actually use the same RNN as an encoder and decoder, however this is the implementation in the original
paper (Sutskever et al. 2014)
“because doing so increases the number model parameters at negligible computational cost and makes it natural to train the RNN
on multiple language pairs simultaneously.”
FYI - Break
Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Part 2: Stuff with Attention
- Conditional Language Model
- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained on uncurated
Web text
NLG Head to Toe @hadyelsahar
Encoder-Decoder (limitations)
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
RNN RNN RNN RNN
<s> am hungry
I
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties (Conneau et al. ACL 2018)
The last hidden state is A single vector
representing the whole input (Bottle neck)!
The encoder is not able to compress the
whole sentence into one vector .
NLG Head to Toe @hadyelsahar
Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I
cost
RNN
<s>
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
Attention
Feed fwd
Weighted average of all
encoder hidden states
Decoder hidden
state is used as a query
NLG Head to Toe @hadyelsahar
Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am
cost cost
RNN RNN
<s> I
Attention
Weighted average of all
encoder hidden states
Decoder hidden
state is used as a query
Feed fwd
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
NLG Head to Toe @hadyelsahar
Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry
cost cost cost
RNN RNN RNN
<s> am
I
Attention
Weighted average of all
encoder hidden states
Decoder hidden
state is used as a query
Feed fwd
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
NLG Head to Toe @hadyelsahar
Attention Mechanism
<s> j’ ai faim
RNN RNN RNN RNN
I am hungry </s>
cost cost cost cost
RNN RNN RNN RNN
<s> am hungry
I
Attention
Weighted average of all
encoder hidden states
Decoder hidden
state is used as a query
Feed fwd
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
NLG Head to Toe @hadyelsahar
Attention Mechanism (Luong et al. 2016)
Query (ht
): the hidden state of the decoder at time step t
Keys and Value: the hidden states of the encoder:
Attention Weights ɑt
(s):
Attention output (ct
): the output of the attention mechanism weighted sum of the Values according to the
attention weights.
Is combined with ht
to predict the decoder output
NLG Head to Toe @hadyelsahar
<s> j’ ai faim
RNN RNN RNN RNN
RNN
<s>
Feed fwd
ht
Attention Mechanism (Luong et al. 2016)
ct
NLG Head to Toe @hadyelsahar
Attention Mechanism (Luong et al. 2016)
3 type of Attention Weights calculations:
ht
dot
general
concat
x
x x
trainable
x
trainable
x
The output of all score functions above is a float number
NLG Head to Toe @hadyelsahar
Attention Mechanism (Luong et al. 2016)
3 type of Attention Weights calculations:
ht
dot
general
concat
x
x x
trainable
x
trainable
x
The output of all score functions above is a float number
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
global , local , and
location are three
other combinations
read about them in the
paper.
NLG Head to Toe @hadyelsahar
FYI - Break
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015)
Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
In Machine Translation Visualizing Attention weightsis a common practice.
This shows words in the source sentence is important for the output of each word in the target
(alignments).
These alignments are learned end-to-end without explicit alignments between tokens in the source x
and target y sentences.
NLG Head to Toe @hadyelsahar
FYI - Break
Seq2Seq models are one of the landmarks of the Deep Learning revolution for NLP.
But soon got taken over by Self-Attention (transformers), which will see later in the next slides.
2018
GPT
Transformers
2020
2019 2021
2017
2016
2015
2014
Bahdanau et al.15
Luong et al. 16
Sutskever et al.
14
Hermann et al. Neurips 2015
Rush et al. EMNLP2015
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Part 2: Stuff with Attention
- Conditional Language Model
- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained on uncurated
Web text
NLG Head to Toe @hadyelsahar
Attention Is All You Need
Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗
Neurips2017
Transformers
“Attention is all you need”
Self attention
Residual
(skip)
connections
Positional
encoding
Seq2seq Seq2Seq with
Attention
Transformers
Encoding input RNN RNN Attention
Decoding output RNN RNN Attention
Encoder-Decoder
interaction
Fixed
vector
Attention Attention
Idea from lena-voita.github.io/nlp_course
N layers encoder and N layers
Decoder.
The last layer of encoder is
connected to all decoder layers
with Enc-dec attention
Decoder Self
attention
Encoder
decoder
attention
NLG Head to Toe @hadyelsahar
Attention Is All You Need
Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗
Neurips2017
Transformers
Lots of great resources to learn about Transformers:
- Original blogpost from google
- Lena Voita’s NLP For you
- Micheal Phi’s Illustrated guide to transformers
- Jay Alammar’s The illustrated Transformer
- The annotated transformer (Alexander Rush, Vincent Nguyen and Guillaume Klein)
- Karpathy’s minGPT
“Attention is all you need”
- Replace RNN with Attention
- Representing Sequence order using positional encoding
- Two types of attention:
- Self Attention (new!)
- Encoder-Decoder Attention
- Multi-heads for attention
- Skip connections allows better stacking to larger number of
layers
Encoder
decoder
attention
Self attention
Residual
(skip)
connections
Positional
encoding
Decoder Self
attention
N layers encoder and N layers
Decoder.
The last layer of encoder is
connected to all decoder layers
with Enc-dec attention
NLG Head to Toe @hadyelsahar
Transformers
Encoder
decoder
attention
Self attention
Decoder Self
attention
Multi-Head
Attention
Multi-Head Attention consists of several attention layers
running in parallel.
Each one is a “scaled dot-product attention”
The holy grail of
the transformers
NLG Head to Toe @hadyelsahar
Query Key Value
“The concepts come from retrieval
systems. The search engine will map your
query against a set of keys associated with
candidate results in the database, then
present you the best matched videos
(values).”
https://stats.stackexchange.com/a/424127/22327
Multi-head Attention
NLG Head to Toe @hadyelsahar
Multi-head Attention
Encoder
Self
attention
Encoder
decoder
attention
Decoder Self
attention
3 instances of Multi-head attention:
1. Encoder self-attention
2. Decoder Self-attention (Masked)
3. Encoder-Decoder Attention
NLG Head to Toe @hadyelsahar
Multi-head Attention
Encoder
Self
attention
Encoder Encoder Encoder
Encoder “Self-Attention”
Represent each token in
the encoder representation
(input) by attending on
other in the encoder (input)
Instead of encoding the
whole input using an RNN
allow tokens to look at each
other.
NLG Head to Toe @hadyelsahar
Multi-head Attention
Decoder Self
attention
Decoder Decoder Decoder
Decoder “Self-Attention”
Represent previously
generated tokens token in
the decoder by attending on
other in the encoder (input)
Previously this was the
RNN Decoder hidden state.
NLG Head to Toe @hadyelsahar
Encoder
decoder
attention
Encoder Encoder Decoder
Decoder “Self-Attention”
Multi-head Attention
Previously this was using
the RNN Decoder
“hidden state” to do
attention over the encoder
input representation
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
Still how does
multi-head attention
work?
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
Step 1: Linear projection of Key Query and Values
Wv
Wk
Wq
V K Q
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
Step 2: Dot product of Queries and Keys
K
Q
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
K
Q
x =
Attention scores
elephants
are
smart
elephants
are
smart
Step 2: Dot product of Queries and Keys
2 5 3
5 1 4
3 4 3
Q
K
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
2 5 3
5 1 4
3 4 3
Step 3 : Scale down Attention scores
divide by the square root of the dimension of query and key
“We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients”
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
2 5 3
5 1 4
3 4 3
Step 4 : Softmax of the Scaled Scores
Softmax across the Key dimension
(Each row)
Q
K
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
0.2 0.5 0.3
0.5 0.1 0.4
0.3 0.4 0.3
Step 4 : Softmax of the Scaled Scores
Q
K
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
0.2 0.5 0.3
0.5 0.1 0.4
0.3 0.4 0.3
Step 5 : Multiply Softmax to values
V
x
K2
K1
K3
Q1
Q2
Q3
V1
V2
V3
In this example we end up with 3 vectors each corresponds to the return of Q1
, Q2
and
Q3
. Each is a weighted average of the values vectors according to the attention
weights
=
NLG Head to Toe @hadyelsahar
Multi-head Attention (Deeper look)
What does it mean to be “Multi-head”
- Multiple parallel heads focus on different things each.
- Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention
with full dimensionality.
linear linear linear
Attention Head 1
V K Q
linear linear linear
Attention Head 2
V K Q
Scale down to
h is the number of parallel
heads (=2 here)
Embeddings of each token in
the input size dmodel
Output of each head
is of size
Concatenate outputs
of each attention
head
Wo
Output of a “2 head” multi-head
attention
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
Now we know what does it
mean by “self-attention”
and“ multi-head attention”
Still too many
parts ...
5 Output layer
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Embedding layer
Elephants are smart
Word embeddings
Positional embeddings
Positional embeddings
Since we are not using RNNs positional embeddings keep word
order information
Elephants are smart
0 1 2
smart are elephants
2 1 0
Odd Index: create a vector using the cos function.
Even index: create a vector using the sin function.
- Motivation: would allow the model to easily learn to attend by
relative positions.
- In practice: Indifferent from learned “positional embeddings”
- Allows longer sequences at test time than those seen during
training
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
NLG Head to Toe @hadyelsahar
Encoder Self-attention
Multi-Head
Attention
Elephants are smart
The input will be copied
3 times as Q, K and V
NLG Head to Toe @hadyelsahar
Encoder Residual connections
Elephants
Deep residual learning for image recognition CVPR 2015
https://arxiv.org/pdf/1512.03385.pdf
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
“We provide comprehensive
empirical evidence showing
that these residual networks
are easier to optimize, and
can gain accuracy from
considerably increased
depth”
Residual
(skip)
connections
The signal
flows from
bottom to top
of the layer
(skip).
are smart
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder
Masked Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Decoder (Masked) Self-attention
éléphants sont intelligents
Imagine a machine translation example
Source(input) elephants are smart
Target(output) éléphants sont intelligents
sont
We are are at time step 2:
The Transformer generated a probability distribution
corresponding to the correct token “éléphants”
Now it is expected now to generate a probability
distribution corresponding to the token “sont”
<BOS>
NLG Head to Toe @hadyelsahar
Decoder (Masked) Self-attention
Imagine a machine translation example
Source(input) elephants are smart
Target(output) éléphants sont intelligents
sont
We are are at time step 2:
The Transformer generated a probability distribution
corresponding to the correct token “éléphants”
Now it is expected now to generate a probability
distribution corresponding to the token “sont”
Problem:The answer is already given!
The model will have nothing to predict it
will just echo the input to the output.
éléphants sont intelligents
<BOS>
NLG Head to Toe @hadyelsahar
Decoder (Masked) Self-attention
sont
Masked Self-Attention prevents the decoder from looking ahead.
This is done inside the decoder
Multi-head self-attention
éléphants
sont
intelligents
éléphants
sont
intelligents
1 4 4 1
4 3 2 1
4 2 3 1
1 1 1 2
Q
K
éléphants sont intelligents
<BOS>
<BOS>
<BOS>
0 -inf -inf -inf
0 0 -inf -inf
0 0 -inf -inf
0 0 0 -inf
x
Timestep 1: When the target
word is “elephant” you can get
values corresponding to
“<BOS>” from the input
Timestep 2: When the target
word is “sont”
you can only see “<BOS>” and
“éléphants”
Mask
NLG Head to Toe @hadyelsahar
Decoder (Masked) Self-attention
<EOS>
Masked Self-Attention prevents the decoder from looking ahead.
This is done inside the decoder
Multi-head self-attention
éléphants
sont
intelligents
éléphants
sont
intelligents
1 4 4 1
4 3 2 1
4 2 3 1
1 1 1 2
Q
K
éléphants sont intelligents
<BOS>
<BOS>
<BOS>
0 -inf -inf -inf
0 0 -inf -inf
0 0 0 -inf
0 0 0 0
x
Timestep 4: When the target
word is “ <EOS>”
you can only see the whole
input.
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder
Masked Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder
Masked Self
attention
Multi head attention is Multi-Head Attention consists of
several attention layers running in parallel.
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Decoder-Encoder Attention
Encoder
decoder
attention
éléphants sont intelligents
<BOS>
Elephants are smart Elephants are smart
Embeddings of the
last layer of the
encoder are used as
keys and values
Embeddings of
the *layer N* of
the Decoder are
used as keys and
values
NLG Head to Toe @hadyelsahar
Transformer in order
Encoder
Self
attention
Encoder
decoder
attention
Decoder Self
attention
1
3
5
Residual
(skip)
connections
2
Almost there!
Output layer
NLG Head to Toe @hadyelsahar
éléphants sont intelligents
<BOS>
Output Layer + Loss Function
Output of each
transformer layer is
similar dimension to the
encoded input
Linear
Transformation
x =
dmodel
dmodel
|Vocab|
|Vocab|
|Vocab|
softmax =
éléphants
sont
intelligents
<EOS>
No RNN here feed all output tokens at once and
calculate loss MASKED self-attention will take care
of illegal connections
cost
cost
cost
cost
Reference sentence
(1 hot encoded)
Reference sent
delayed by 1
time step
Same Cross entropy
loss
NLG Head to Toe @hadyelsahar
Question Time
Encoder
Self
attention
Encoder
decoder
attention
Multi-Head
Attention
Decoder Self
attention
1
3
4
Residual
(skip)
connections
2
2
5 Output layer
NLG Head to Toe @hadyelsahar
Q: Transformers are seq2seq can we
use them for unconditional LM?
NLG Head to Toe @hadyelsahar
Q: Transformers are seq2seq can we
use them for unconditional LM?
Yes but redundant
Elephants are Elephants are
Smart
NLG Head to Toe @hadyelsahar
Q: Transformers are seq2seq can we
use them for unconditional LM?
GPT
Yes but redundant
Masked
Multi-Head
Attention
Add &Norm
Add & Norm
Feed Forward
Decoder only Transformer
Elephants are Elephants are
Smart
- No Encoder
- No Decoder-Encoder
Attention
Last Technical slide
NLG Head to Toe @hadyelsahar
Q: How such “transform”ative ideas come up?
A: Good teamwork between 8 authors
FYI - Break
“Equal contribution. Listing order is random.
Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.
Ashish, with Illia, designed and implemented the first Transformer models.
Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position
representation.
Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor.
Llion also experimented with novel model variants, our initial codebase, and efficient inference and
visualizations.
Lukasz and Aidan designing various parts of and implementing tensor2tensor library”
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Part 2: Stuff with Attention
- Conditional Language Model
- Seq2seq Encoder-Decoder Models
- Seq2seq Encoder-Decoder with Attention
- Transformers (Self-Attention)
- Open Problems of NLG Models trained
on uncurated Web text
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Large Language Models
And Their Dangers
NLG Head to Toe @hadyelsahar
177
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Rush et al. EMNLP2015
GPT
“Our approach is a combination
of two existing ideas:
transformers and unsupervised
pre-training.”
GPT was created
Originally for NLU !
NLG Head to Toe @hadyelsahar
178
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Sources: The Guardian, the next web
“As an experiment in responsible
disclosure, we are instead releasing a
much smaller model for researchers to
experiment with.”
NLG Head to Toe @hadyelsahar
179
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Big claims on
unprecedented
generation
capabilities!
NLG Head to Toe @hadyelsahar
180
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Sources: The Guardian, the next web
NLG Head to Toe @hadyelsahar
181
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Criticisms to Neural Language Generation !
Neural Unicorns
should be put
on a leash
NLG Head to Toe @hadyelsahar
182
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Holtzman et al. ICLR2020
Degeneration
NLG Head to Toe @hadyelsahar
183
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
“
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Symbolic AI & Semantic Correctness
“for a human or a machine to
learn a language, they must
solve what Harnad (1990) calls
the symbol grounding
problem.”
Form vs Meaning
Fluency ≠ Semantic Correctness
O observes that certain words tend to occur in similar
contexts .. learns to generalize across lexical patterns
by hypothesizing that they can be used
interchangeably.
O has never observed these objects, and thus would not
be able to pick out the referent of a word when
presented with a set of (physical) alternatives.
NLG Head to Toe @hadyelsahar
Petroni et al. EMNLP 2019
184
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Jiang et al. TACL 2020
Seq2Seq
Attention
Abstractive summ.
Factual Correctness
Kassber al. ACL 2020
NLG Head to Toe @hadyelsahar
185
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Discussions around AI Ethics 🚨
Gender Shades: Buolamwini 2017
NLG Head to Toe @hadyelsahar
186
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
source: MIT Technologyreview
source: https://twitter.com/minimaxir/
NLG Head to Toe @hadyelsahar
187
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
It is not about only single examples. Itʼs also a Distributional Bias
Abubakar abid
keynote at
#muslimsinAI
workshop in
Neurips 2020
https://twitter.com/shakir_za/status/1336335755656929288?lang=en
https://twitter.com/abidlabs/status/1291165311329341440?lang=en
NLG Head to Toe @hadyelsahar
188
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Prates al. Neural computation 2019
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Distributional Bias
NLG Head to Toe @hadyelsahar
189
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Distributional Bias
Stanovsky et al. ACL2019
NLG Head to Toe @hadyelsahar
190
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Distributional Bias (Open ended NLG )
Sheng et al. EMNLP 2019
Sentiment
NLG Head to Toe @hadyelsahar
191
2018
GPT Gebru et al. FACT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Distributional Bias (cloze style)
Nadeem et al. 2020
NLG Head to Toe @hadyelsahar
192
2018
GPT Gebru et al. FACcT2021
Stochastic
parrots
Transformers
2020
2019
2021
2017
2016
NLG
Trends
2015
2014
Bahdanau et al.15
Microsoft Tay
(racist Chatbot)
Sutskever et
al. 14
Hermann et al. Neurips 2015
Seq2Seq
Attention
Abstractive summ.
Timnit Gebru [Left] Margaret Mitchell [right], Were fired from google over “The Stochastic parrots” paper
https://www.wired.com/story/second-ai-researcher-says-fired-google/
Read the paper (Bender et al. FAccT21)
NLG Head to Toe @hadyelsahar
NLG head to toe- @Hadyelsahar
Huh .. ?
Hady Elsahar
@hadyelsahar
Hady.elsahar@naverlabs.com
That’s All folks
Great
Help me to make this tutorial better please participate
in this anonymous survey :
https://forms.gle/Xr93EFiY2zStksMK8
Also reach out for
feedback or questions

Neural Language Generation Head to Toe

  • 1.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Hady Elsahar @hadyelsahar Hady.elsahar@naverlabs.com 1 Neural Language Generation Head to Toe
  • 2.
    NLG Head toToe @hadyelsahar PhD. 2019 Research Scientist 2020 About Me http://hadyelsahar.io Intern. 2018 Intern. Intern. Research interests: Controlled NLG - Distributional Control of Language generation - Energy Based Models, MCMC - Self-supervised NLG Domain Adaptation - Domain shift detection - Model Calibration Side Gigs I actively participate in wikipedia research. I build tools to help editors from under-resourced language. (like Scribe) Masakhane A grassroots NLP community for Africa, by Africans https://www.masakhane.io/ 2013 2014 Masters 2015 Side gigs @hadyelsahar
  • 3.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Great Resources Online That helped developing this tutorial Lena Voita - NLP Course | For You https://lena-voita.github.io/nlp_course.html CS224n: Natural Language Processing with Deep Learning Stanford / Winter 2021 http://web.stanford.edu/class/cs224n/ Lisbon Machine Learning School http://lxmls.it.pt/2020/ Speech and Language Processing [Book] Dan Jurafsky and James H. Martin https://web.stanford.edu/~jurafsky/slp3/ Courses Books You are smart & the Internet is full of great resources. But might be also confusing, many of the parts in this tutorials took me hours to grasp. I am only here to make it easier for you :) Nothing in this tutorial you cannot find online
  • 4.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar What are we going to Learn today? - Introduction to Language Modeling - Recurrent Neural Networks (RNN) - How to generate text from Neural Networks - How to Evaluate Language Models Seq2seq Models - Conditional Language Model - Seq2seq with Attention - Transformers Open Problems of NLG Stochastic Parrots Part 1: Language Modeling Part 2: Stuff with Attention
  • 5.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Part 1: Language Modeling - Introduction to Language Modeling - N-gram Language Models - Neural Language models - Recurrent Neural Networks (RNN) - Generating text from Language Models (fun!) - Temperature Scaling - Top-k / Nucleus Sampling - LM Evaluation: - Cross Entropy, Perplexity
  • 6.
    NLG Head toToe @hadyelsahar Language Modeling What is language modeling? Assigning probabilities to sequences of words (or tokens). P(“The cat sits on the mat”) = 0.0001 P(“Imhotep was an Egyptian chancellor to the Pharaoh Djoser”) = 0.004 P(“”Zarathushtra was an ancient spiritual leader who founded Zoroastrianism.) = 0.005 Why on earth? ...
  • 7.
    NLG Head toToe @hadyelsahar Language Modeling Language models are used everywhere! Search Engines P(“global warming is caused by”) = 0.1 P(“global warming is a phenomenon related to ”) = 0.05 P(“global warming is due to”) = 0.03 ... You can rank sentences (search Queries) by their probability
  • 8.
    NLG Head toToe @hadyelsahar Language Modeling Language models are used everywhere! Spell checkers P(“... is Wednesday ….”) = 0.1 P(“... is Wendnesday …. ”) = 0.000005 ... You can recommend rewritings based on probability of sequences.
  • 9.
    NLG Head toToe @hadyelsahar Conditional Language Modeling We can assign probabilities to sequences of words Conditioned on a Context What is a context ? ... Context can be anything: Image ● Speech ● Text ● Text in another language P(“I am hungry” | “J’ai Faim”) = 0.1 P(“I am happy” | “J’ai Faim”) = 0.002 P(“I was hungry” | “J’ai Faim”) = 0.0002 P(“he am hungry” | “J’ai Faim”) = 0.00001
  • 10.
    NLG Head toToe @hadyelsahar Conditional Language Modeling Conditional Language models are even more popular Machine translation P(“I am hungry” | “J’ai Faim”) = 0.1 P(“I am happy” | “J’ai Faim”) = 0.002 P(“I was hungry” | “J’ai Faim”) = 0.0002 P(“he am hungry” | “J’ai Faim”) = 0.00001 This is a notation for conditional probability Sequences with correct translations are given high probability
  • 11.
    NLG Head toToe @hadyelsahar Conditional Language Modeling Conditional Language models are even more popular Abstractive Summarization P( “COV D numbers are dropping down” | 📄📄📄) = 0.6 P( “COV D stats in France” | 📄📄📄) = 0.005 P( “ COV D the coronavirus pandemic” | 📄📄📄) = 0.0001 Better Summaries are given high probabilities
  • 12.
    NLG Head toToe @hadyelsahar Conditional Language Modeling Conditional Language models are even more popular Image captioning P( “A dog eating in the park” | 🐕 🍔🌳 ) = 0.6 P( “A dog in the park” | 🐕 🍔🌳 ) = 0.03 P( “A cat in the tree” | 🐕 🍔🌳 ) = 0.002 Sequences with correct captions are given high probability We will learn that with our first language model Still, How to get those probabilities ? ...
  • 13.
    NLG Head toToe @hadyelsahar FYI - Break P(“The cat sits on the mat”) = 0.0001 Now you know why assigning probabilities to sequences of words is important. Letʼs see how we can do that. But first .. Language Modeling P(“I am hungry” | “J’ai Faim”) = 0.1 (Conditional) Language Modeling We will start by this one since it is simpler We will get to this one later
  • 14.
    NLG Head toToe @hadyelsahar 🍼 A Very Naive Language Model Calculate probability of a sentence given in a large amount of text. P(“Elephants are the largest existing land animals.”) = ?? A dataset of 100k sentences Count(“Elephants are the largest existing land animals.”) = 15 P(“Elephants are the largest existing land animals.”) = 15/100k = 0.00015 Mmmm.. What could possibly go wrong?
  • 15.
    NLG Head toToe @hadyelsahar 🍼 A Very Naive Language Model Letʼs use it for spell checking All sequences are equally wrong?... Count (“ My Favourite day of the week is Wednesday ….”) = 0.00000 Count (“ My Favourite day of the week is Wendnesday ….”) = 0.00000 Count ( “ asdal;qw;e@@k__+0$%$^%….”) = 0.00000 P(“ My Favourite day of the week is Wednesday ….”) = 0.00000 P(“ My Favourite day of the week is Wendnesday ….”) = 0.00000 P(““ asdal;qw;e@@k__+0$%$^%….”) = 0.00000 Sequences are not in the dataset A dataset of 100k sentences
  • 16.
    NLG Head toToe @hadyelsahar Question Time How many unique (valid or invalid) sentences of length 10 words can we make out of english language? If the Number of unique words in english = 1Million
  • 17.
    NLG Head toToe @hadyelsahar Question Time If the Number of unique words in english = 1Million Answer: 1M x 1M x 1M … (10 times) = 1000000 10 = 1 x 10 60 Each time we have 1M word to select from. How many unique (valid or invalid) sentences of length 10 words can we make out of english language?
  • 18.
    NLG Head toToe @hadyelsahar Question Time How many unique sentences of MAX length 10 words can we make out of english language? Number of unique words in english = 1Million
  • 19.
    NLG Head toToe @hadyelsahar Question Time How many unique sentences of MAX length 10 words can we make out of english language? Number of unique words in english = 1Million Answer: 1M + 1M 2 + 1M 3 + 1M 4 + …. + 1M 10 = 1.00000160 All sequences of length 1 word All sequences of length 3 word
  • 20.
    NLG Head toToe @hadyelsahar Combinatorial explosion In Language Generation Log scale Number of possible english sentences of length 50 words is ~ 1x 10660 Number of atoms in the universe ~ 1x 1082 No dataset can have such number of sentences. Most of sentences will have zero probabilities.
  • 21.
    NLG Head toToe @hadyelsahar Using the chain rule, as follows: P(Elephants are smart animals) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) These are quite easy to find in a limited size corpus. Part of the problem is already solved! This are still hard to find, we will learn how to deal with those at a later point. Atomic units of calculating probabilities Became words instead of full sentences From Sequence Modeling To modeling the probability of the next word
  • 22.
    NLG Head toToe @hadyelsahar From Sequence Modeling To modeling the probability of the next word Using the chain rule, as follows: w<t is the notation for all Words before time step t W (in bold) is a sequence of N words w1 w2 ,.... wN
  • 23.
    NLG Head toToe @hadyelsahar FYI - Break There are terms associated to this method of modeling language: “Left to Right language modeling” , “autoregressive Language models” Other ways of modeling language (not discussed in this tutorial). Bidirectional Language Modeling Words in sentences are Generated independently Right context Non-autoregressive Language Models
  • 24.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Language Modeling - Introduction to Language Modeling - N-gram Language Models (Recap) - Neural Language models - Recurrent Neural Networks (RNN) - Generating text from Language Models (fun!) - Temperature Scaling - Top-k / Nucleus Sampling - LM Evaluation: - Cross Entropy, Perplexity
  • 25.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar N-Gram Language Models
  • 26.
    NLG Head toToe @hadyelsahar N-Gram Language Models sequence unigrams bigrams trigrams n=1 n=2 n=3 Elephants are smart animals ● Elephants ● are ● smart ● animals ● Elephants are ● are smart ● smart animals ● Elephants are smart ● are smart animals What is an N-gram? ...
  • 27.
    NLG Head toToe @hadyelsahar N-Gram Language Models P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | Elephants , are , smart, animals ) x P(are | Elephants , are , smart, animals , they) x P(quite | Elephants , are , smart, animals , they , are) x P(big | Elephants , are , smart, animals , they , are, quite) These are quite easy to find and count in a limited size corpus. We will learn how to deal with those now! Recall this problem?
  • 28.
    NLG Head toToe @hadyelsahar N-Gram Language Models P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | Elephants , are , smart, animals ) x P(are | Elephants , are , smart, animals , they) x P(quite | Elephants , are , smart, animals , they , are) x P(big | Elephants , are , smart, animals , they , are, quite) Assumption: N-gram language models uses the assumption that a probability of a word only depends on it N previous tokens. P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | Elephants , are , smart, animals ) x P(are | Elephants , are , smart, animals , they) x P(quite | Elephants , are , smart, animals , they , are) x P(big | Elephants , are , smart, animals , they , are, quite) Tri-gram language model
  • 29.
    NLG Head toToe @hadyelsahar N-Gram Language Models P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | Elephants , are , smart, animals ) x P(are | Elephants , are , smart, animals , they) x P(quite | Elephants , are , smart, animals , they , are) x P(big | Elephants , are , smart, animals , they , are, quite) Assumption: N-gram language models uses the assumption that a probability of a word only depends on it N previous tokens. P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | are , smart, animals ) x P(are | smart, animals , they) x P(quite | animals , they , are) x P(big | they , are, quite) Tri-gram language model Now these are easier to find and count in a corpus
  • 30.
    NLG Head toToe @hadyelsahar FYI - Break Markov Property Born 14 June 1856 N.S. Ryazan, Russian Empire Died 20 July 1922 (aged 66) Petrograd, Russian SFSR Nationality Russian https://en.wikipedia.org/wiki/Markov_property P(Elephants are smart animals they are quite big) = P(Elephants) x P(are |Elephants ) x P(smart | Elephants , are) x P(animals| Elephants , are , smart) x P(they | Elephants , are , smart, animals ) x P(are | Elephants , are , smart, animals , they) x P(quite | Elephants , are , smart, animals , they , are) x P(big | Elephants , are , smart, animals , they , are, quite)
  • 31.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Language Modeling - Introduction to Language Modeling - N-gram Language Models - Neural Language models - Recurrent Neural Networks (RNN) - Generating text from Language Models (fun!) - Temperature Scaling - Top-k / Nucleus Sampling - LM Evaluation: - Cross Entropy, Perplexity
  • 32.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Neural Language models Recurrent Neural Networks (RNN)
  • 33.
    NLG Head toToe @hadyelsahar Neural Language Modeling Remember this? w<t is the notation for all Words before time step t * empty sequence We are going to learn a neural network for this Ө are the learnable params of the Neural network. Animation from NLP course | For you by Lena Voita https://lena-voita.github.io/nlp_course/language_modeling.html
  • 34.
    NLG Head toToe @hadyelsahar Language Modeling using Feed Fwd NN Ө are the learnable params of the Neural network. Өe Өw 1 hot encoding of Words Embedding layer Word embeddings ; |V| x hembed L x |V| L x hembed L hembed x 1 concatenate L hembed x |V| T |V| x 1 |V| x 1 softmax Projection layer These are usually called logits Prob distribution over words in vocab Can we use a Feed Fwd Neural Network? elephants are the smartest PӨ (animals | elephants, are, the, smartest) = ? animals = 0.1 Fixed width Of L tokens
  • 35.
    NLG Head toToe @hadyelsahar Өe Өw 1 hot encoding of Words Embedding layer Word embeddings ; |V| x hembed L x |V| L x hembed L hembed x 1 concatenate L hembed x |V| T |V| x 1 |V| x 1 softmax Projection layer These are usually called logits Prob distribution over words in vocab Can we use a Feed Fwd Neural Network? elephants are the smartest PӨ (animals | elephants, are, the, smartest) = ? animals = 0.1 Fixed width Of L tokens Here L = 4 PӨ (they| elephants, are, the, smartest, animals) → L > 4 (not possible) PӨ (they| elephants, are, the, smartest, animals) → Markov assumption (now possible) PӨ (the| elephants, are) → L < 4 (not possible) PӨ (they|<MASK> , <MASK> , elephants , are) → adding dummy tokens (now possible) In practice this isn’t a good idea. The Problem with fixed length input
  • 36.
    NLG Head toToe @hadyelsahar 1. Recurrent (calculated repeatedly) 2. Great with Languages ! 3. Have an internal memory called “Hidden state” 4. Can model infinite length sequences 5. No need for markov assumption Recurrent Neural Networks 1 hot encoding of Words 1 x |V| h1 h<BOS> elephants RNN + P(are | elephants) = 0.1 are RNN + P(smart| elephants , are) = 0.2 h2 h3 ……. ……. Wi Wo 1 hot encoding of Words Embedding layer Word embeddings |V| x Hembed 1 x |V| 1 x Hembed 1 x |V| softmax V 1 x |V| Output Embedding layer |V| x Hstate + Hidden States ht 1 x Hstate Previous hidden sate ht-1 logits RNN U The core RNN is considered only this part. As it operates on sequences of continuous vectors elephants are = 0.1 Probability of each word in the vocabulary Select the one corresponds to the next token smart RNN + P(animals| elephants , are, smart) = 0.5 logits xt x. t yt ŷt
  • 37.
    NLG Head toToe @hadyelsahar Recurrent Neural Networks Wi Wo 1 hot encoding of Words Embedding layer Word embeddings |V| x Hembed 1 x |V| 1 x Hembed 1 x |V| softmax ŷt V 1 x |V| Output Embedding layer |V| x Hstate + Hidden States ht 1 x Hstate Previous hidden sate ht-1 RNN For simplicity the figure doesn’t include the bias terms b and c U Embed 1 hot vectors to word embeddings. Wi could be trained or kept frozen Calculation of the hidden state representations ht depends on ht-1 Project the hidden state into the output space. Calculate probabilities out of the logits logits xt x. t yt
  • 38.
    NLG Head toToe @hadyelsahar Recurrent Neural Networks Wi Wo 1 hot encoding of Words Embedding layer Word embeddings |V| x Hembed 1 x |V| 1 x Hembed 1 x |V| softmax ŷt V 1 x |V| Output Embedding layer |V| x Hstate + Hidden States ht 1 x Hstate Previous hidden sate ht-1 logits RNN For simplicity the figure doesn’t include the bias terms b and c xt U Embed 1 hot vectors to word embeddings. W i could be trained or kept frozen yi t x. t Calculation of the hidden state representations ht depends on ht-1 Project the hidden state into the output space. Calculate probabilities out of the logits Wi Wo V U are trainable parameters Ok but how to train them ? yt Prob. of word i at time step t given by the model
  • 39.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks (data preprocessing ) African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears …. 1) Collect large amount of free text
  • 40.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks (data preprocessing ) African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears …. African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears. 1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization)
  • 41.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks (data preprocessing ) African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears …. African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears. <s> African elephants are the largest land animals on Earth . </s> <s> They are slightly larger than their Asian cousins . </s> <s> They can be identified by their larger ears . </s> 1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization) 3) split into tokens (e.g. word, char, sub-word units) Add start of seq and end of seq tokens <s> </s>
  • 42.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks (data preprocessing ) African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears …. African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears. <s> African elephants are the largest land animals on Earth . </s> <s> They are slightly larger than their Asian cousins . </s> <s> They can be identified by their larger ears . </s> 1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization) 3) split into tokens (e.g. word, char, sub-word units) Add start of seq and end of seq tokens <s> </s> 0 <s> 1 </s> 2 They 3 on 4 elephants 5 African 6 larger .. 1200 ears 4) Build a vocabulary V More on tokenization in the next lecture
  • 43.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks (data preprocessing ) African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears …. African elephants are the largest land animals on Earth. They are slightly larger than their Asian cousins. They can be identified by their larger ears. <s> African elephants are the largest land animals on Earth . </s> <s> They are slightly larger than their Asian cousins . </s> <s> They can be identified by their larger ears . </s> 1) Collect large amount of free text 2) split into chunks ( e.g. sentence level tokenization) 3) split into tokens (e.g. word, char, sub-word units) Add start of seq and end of seq tokens <s> </s> 0 <s> 1 </s> 2 They 3 on 4 elephants 5 African 6 larger .. 1200 ears 4) Build a vocabulary V More on tokenization in the next lecture - [0, 2 , 4 ,5, 6, 7, 8, 8, 101, 22, 1] - [0, 22, 45, 65, 78, 9, 3, 4, 2, 1] - [0, 1, 23, 3, 4, 5, 65, 7, 7, 8, 1] 4) Index training data Each word (token) can be represented as a one hot vector now!
  • 44.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African cost <s> African elephants are smart </s>
  • 45.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants cost cost <s> African elephants are smart </s>
  • 46.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants African elephants are cost cost cost <s> African elephants are smart </s>
  • 47.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants are African elephants are smart cost cost cost cost <s> African elephants are smart </s>
  • 48.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants are smart African elephants are smart </s> cost cost cost cost cost Cross entropy loss <s> African elephants are smart </s>
  • 49.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants are smart African elephants are smart </s> cost cost cost cost cost Cross entropy loss <s> African elephants are smart </s> Loss function Gradients updates through backpropagation
  • 50.
    NLG Head toToe @hadyelsahar Training Recurrent Neural Networks [0, 2 , 4 ,5, 6, 10] <s> African elephants are smart African elephants are smart </s> cost cost cost cost cost Cross entropy loss <s> African elephants are smart </s> Loss function Gradients updates through backpropagation How to calculate the cross entropy loss?
  • 51.
    NLG Head toToe @hadyelsahar Cross Entropy Claude Shannon Page on wikipedia The “surprisal” of PӨ (empirical distribution) for samples generated from D (the true data distribution). Cross Entropy can also be seen as a “closeness" measure between two distributions. Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Cross entropy direction matters
  • 52.
    NLG Head toToe @hadyelsahar Cross Entropy loss (-ve log likelihood) cost Output distribution of tokens in the vocab By your RNN at time step t yn t i = 1 if i is the correct token yn t i = 0 if i is not the correct token yn t i Training Example Position in the vocab , logits, probability vectors or 1 hot vectors Time step t pn t i prob of the model to the token i in example n and time step t Let’s simplify the notation: yn t i_correct = yn t = 1 pn t i_correct = pn t No need to write Prob of correct token by the model Prob of correct token given previous context Prob of correct sequence Negative Log Likelihood N = number of training examples
  • 53.
    NLG Head toToe @hadyelsahar Maximum log likelihood loss Negative Log Likelihood Minimize loss → minimize Negative log likelihood → Maximum log likelihood Estimation (MLE) This objective is usually written like that Probability of the true sequence y given a language model parameterized by Ө Find the parameters Ө that minimizes the -ve log likelihood Usually done using SGD
  • 54.
    NLG Head toToe @hadyelsahar Practical Tips (parameter sharing) Wi Wo 1 hot encoding of Words Embedding layer Word embeddings |V| x Hembed 1 x |V| 1 x Hembed 1 x |V| softmax ŷt V 1 x |V| Output Embedding layer |V| x Hstate + Hidden States ht 1 x Hstate Previous hidden sate ht-1 logits RNN Embedding layer |V| x Hembed Output Embedding layer |V| x Hstate For simplicity the figure doesn’t include the bias terms b and c xt U yi t x. t yt Prob. of word i at time step t given by the model Wi Wo FYI - Break It is a common practice to unify the embedding sizes Across the whole network. Hstate = Hembed This makes both input / output embedding layers have the same dimensionality. You can tie their weights to reduce parameter size of the RNN . (this is called weight tying / parameter sharing) Share both as one matrix
  • 55.
    NLG Head toToe @hadyelsahar RNNs enjoy great flexibility FYI - Break http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  • 56.
    NLG Head toToe @hadyelsahar The last “hidden state” of a Recurrent Neural Networks could be a sentence repsententation, that can be used later for many tasks e.g. Classification. FYI - Break <s> African elephants are smart Feed Fwd Neural Network Sentiment = positive Classification loss
  • 57.
    NLG Head toToe @hadyelsahar You can Stack Recurrent Neural Networks FYI - Break RNN RNN RNN RNN RNN RNN RNN RNN Layer 2 RNN runs over a sequence of “hidden states” (not softmax output) of Layer 1 RNNs
  • 58.
    NLG Head toToe @hadyelsahar Can I use this RNN to generate text? Now we know how to model text probabilities using Recurrent Neural Networks. Decoding | Inference After all, it is called Neural Language “Generation”
  • 59.
    NLG Head toToe @hadyelsahar Let’s see a demo !! Demo Break Write With Transformer https://transformer.huggingface.co/doc/gpt https://beta.openai.com/playground
  • 60.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Language Modeling - Introduction to Language Modeling - N-gram Language Models - Neural Language models - Recurrent Neural Networks (RNN) - Generating text from Language Models - Greedy Decoding - Temperature Scaling - Top-k / Nucleus Sampling - LM Evaluation: - Cross Entropy, Perplexity
  • 61.
    NLG Head toToe @hadyelsahar Decoding | Inference <s> RNNs output a categorical distribution over Tokens in the vocab at each step pӨ ( * | <s>) Elephants I He They …. smart animals Giraffes Think </s> …. …. 0.4 0.03 0.02 0.001 …. 0.02 0.001 0.016 0.2 0.001 …. …. Auto-regressive decoding
  • 62.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step 1) Select next token (we will see later different selection methods) Elephants He They I …. smart animals Giraffes Think </s> …. …. 0.4 0.03 0.02 0.001 …. 0.02 0.001 0.016 0.2 0.001 …. …. <s> Auto-regressive decoding
  • 63.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step 1) Select next token Elephants He They I …. smart animals Giraffes Think </s> …. …. 0.4 0.03 0.02 0.001 …. 0.02 0.001 0.016 0.2 0.001 …. …. <s> Elephants Auto-regressive decoding
  • 64.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step 2) Feed selected token (auto-regressively) to the RNN and calculate pӨ ( * | <s>, I) Elephants He They I …. are animals Giraffes Think </s> …. …. 0.0002 0.03 0.02 0.001 …. 0.2 0.001 0.016 0.2 0.001 …. …. <s> Elephants Elephants Auto-regressive decoding
  • 65.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step 3) Select next token Elephants He They I …. are animals Giraffes Think </s> …. …. 0.0002 0.03 0.02 0.001 …. 0.2 0.001 0.016 0.2 0.001 …. …. <s> Elephants Elephants Auto-regressive decoding are
  • 66.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step Repeat the process until end of sequence token </s> or max length is reached. Auto-regressive decoding <s> Elephants Elephants are
  • 67.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step Repeat the process until end of sequence token </s> or max length is reached. Auto-regressive decoding <s> Elephants are Elephants are smart
  • 68.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step Repeat the process until end of sequence token </s> or max length is reached. Auto-regressive decoding <s> Elephants are smart Elephants are smart animals
  • 69.
    NLG Head toToe @hadyelsahar Decoding | Inference RNNs output a Multinomial distribution over Tokens in the vocab at each step Repeat the process until end of sequence token </s> or max length is reached. Auto-regressive decoding <s> Elephants are smart animals Elephants are smart animals </s>
  • 70.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Elephants He They I …. are animals Giraffes Think </s> …. …. 0.0002 0.03 0.02 0.001 …. 0.2 0.001 0.016 0.2 0.001 …. …. The maximum value?
  • 71.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Local Global Deterministic Stochastic / Random MAP Beam Search Ancestrall (pure) Sampling Greedy decoding Top k Sampling Nucleus Sampling Sampling with Temperature
  • 72.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Local Global Deterministic Stochastic / Random Greedy decoding MAP Beam Search Ancestrall (pure) Sampling Top k Sampling Nucleus Sampling Sampling with Temperature
  • 73.
    NLG Head toToe @hadyelsahar Greedy Decoding Select the token with max probability Elephants He They I …. are animals Giraffes Think </s> …. …. 0.0002 0.03 0.02 0.001 …. 0.25 0.001 0.016 0.2 0.001 …. ….
  • 74.
    NLG Head toToe @hadyelsahar Question Time During Greedy decoding, at each step the most likely token is selected: Will that generate the highest likely sequence overall? a) Yes, always b) No (but it could happen) c) Never
  • 75.
    NLG Head toToe @hadyelsahar Answer “local” vs “Global” likelihood in sequence generation. a b a b * 0.7 0.3 0.55 0.45 a b a b 0.45 Imagine vocabulary of two tokens “a” and “b” Run greedy decoding for 3 time steps. Selected sequence “a b b” P(“a b b”) = 0.6 * 0.55 * 0.55 = 0.1815 Other sequences could have globally higher probability: P(“b a b”) = 0.4 * 0.8 * 0.9 = 0.288 a b 0.1 0.2 0.8 0.9 0.55
  • 76.
    NLG Head toToe @hadyelsahar Question Time Given a trained Language model pӨ If we run greedy decoding for the context: “<s>” until “</s>” is obtained. We repeat this process 1000 times. How many unique sequences will be obtained? a) 1000 b) Infinity c) 1 d) 42 e) 75000
  • 77.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Local Global Deterministic Stochastic / Random MAP Beam Search Ancestrall (pure) Sampling Greedy decoding Top k Sampling Nucleus Sampling Sampling with Temperature
  • 78.
    NLG Head toToe @hadyelsahar Ancestral (Pure) Sampling Also called “Pure” Sampling, Standard Sampling, or just Sampling. Sampling is stochastic (random) ≠ deterministic Elephants He They I …. are animals Giraffes Think </s> …. …. 0.0002 0.03 0.02 0.001 …. 0.25 0.001 0.016 0.2 0.001 …. …. Pure sampling will obtain “unbiased” samples. I.e. distribution of generated sequences matches the Language model distribution over sequences.
  • 79.
    NLG Head toToe @hadyelsahar Question Time During Pure Sampling, at each step x is sampled: Will that generate the highest likely sequence overall? a) Yes, always b) No (but it could happen) c) Never
  • 80.
    NLG Head toToe @hadyelsahar Question Time Given the following conditional probabilities of a trained language model, If we run pure sampling 10000 times and Greedy 1000 times. How many times the sequence “b b” will be obtained: a) 0 using Greedy & 100 using sampling b) 100 using Greedy & 10 using sampling c) 10 times Greedy & 10 using sampling d) 0 times Greedy & 10000 using sampling a b a b * 0.9 0.55 0.45 a b 0.1 0.9 0.1 Start of sequence (empty) p(b| *) p(b| b * ) p(a| *) p(a| a *) p(b | a *)
  • 81.
    NLG Head toToe @hadyelsahar Ancestral (Pure) Sampling Pros - Diversity in generations (not always the same sequence) - Generated samples reflect the Language Model probability distribution of sequences (Unbiased). Cons Pure sampling sometimes lead to incoherent text Fig. from THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
  • 82.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Local Global Deterministic Stochastic / Random MAP Beam Search Ancestrall (pure) Sampling Greedy decoding Top k Sampling Nucleus Sampling Sampling with Temperature
  • 83.
    NLG Head toToe @hadyelsahar Sampling with Temperature Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier I.e. less likely to sample from unlikely candidates Higher temperature produces a softer probability distribution over tokens, resulting in more diversity and also more mistakes. Divide the logits by a temperature (Constant) value T As T decreases ( yi / T ) increases |V| x 1 |V| x 1 softmax These are usually called logits Prob distribution over words in vocab p(xt | x<t ) = 0.1 ➗ T Good read on the topic https://medium.com/@majid.ghafouri/why-should-we-use-temperature-in-softmax-3709f4e0161
  • 84.
    NLG Head toToe @hadyelsahar Sampling with Temperature Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature Lowering (< 1 ) the temperature of the softmax will make the the distribution peakier I.e. less likely to sample from unlikely candidates
  • 85.
    NLG Head toToe @hadyelsahar Sampling with Temperature Try it yourself: https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies_temperature Higher temperature (> 1) produces a softer probability distribution over tokens, resulting in more diversity and also more mistakes.
  • 86.
    NLG Head toToe @hadyelsahar Let’s see a demo !! Demo Write With Transformer https://transformer.huggingface.co/doc/gpt https://beta.openai.com/playground
  • 87.
    NLG Head toToe @hadyelsahar Top-k Sampling Hierarchical Neural Story Generation (Fan et al. 2018) Elephants He They I are animals Giraffes Think </s> ……. …... 0.3 0.25 0.1 0.1 0.03 0.02 0.01 0.01 0.01 …. …. K = 4 0.3 0.25 0.1 0.1 0.399 0.333 0.133 0.133 normalize Sample At each timestep, randomly sample from the k most likely candidates from the token distribution He
  • 88.
    NLG Head toToe @hadyelsahar Top-k Sampling Hierarchical Neural Story Generation (Fan et al. 2018) Elephants He They I are animals Giraffes Think </s> ……. …... 0.3 0.25 0.1 0.1 0.03 0.02 0.01 0.01 0.01 …. …. K = 4 A fixed size k in Top-k sampling is not always a good idea: Elephants He They I are animals Giraffes Think </s> ……. …... 0.6 0.3 0.02 0.02 0.02 0.02 0.01 0.01 0.01 …. …. K = 4
  • 89.
    NLG Head toToe @hadyelsahar Top-p (Nucleus) Sampling Elephants He They I are animals Giraffes Think </s> ……. …... 0.3 0.2 0.1 0.05 0.03 0.02 0.01 0.01 0.01 …. …. p = 0.6 0.3 0.2 0.1 0.5 0.333 0.166 normalize Sample Elephants Sample from the top p % of the probability mass THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
  • 90.
    NLG Head toToe @hadyelsahar Top-p (Nucleus) Sampling Elephants He They I are animals Giraffes Think </s> ……. …... 0.3 0.3 0.5 0.1 0.03 0.02 0.01 0.01 0.01 …. …. Top-p = Adaptive top-k Elephants He They I are animals Giraffes Think </s> ……. …... 0.4 0.4 0.02 0.02 0.02 0.02 0.01 0.01 0.01 …. …. p = 0.8 p = 0.8 THE CURIOUS CASE OF NEURAL TEXT DeGENERATION (hotlzman et al. 2020)
  • 91.
    NLG Head toToe @hadyelsahar Decoding | Inference How to “Select” the next token given a categorical distribution over Tokens in the vocab. Auto-regressive decoding Local Global Deterministic Stochastic / Random MAP Beam Search Ancestrall (pure) Sampling Greedy decoding Top k Sampling Nucleus Sampling Sampling with Temperature Next Lecture!
  • 92.
    NLG Head toToe @hadyelsahar Demo again! Demo Break Write With Transformer https://transformer.huggingface.co/doc/gpt https://beta.openai.com/playground
  • 93.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Language Model Evaluation & More about “Cross-Entropy”
  • 94.
    NLG Head toToe @hadyelsahar Language Modeling Evaluation For classification Higher Accuracy is always better: ● Accuracy = 80% Better than 60% But for language modeling? A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ Language Model Is that good? P(x) = 0.001 Test set x
  • 95.
    NLG Head toToe @hadyelsahar Language Modeling Evaluation Intrinsic Metrics ● perplexity ● cross entropy ● bits-per-character (BPC) A good read on the topic by Chip Huyen: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ Extrinsic Metrics Language Model X ~ PӨ Grammatically correct Fluent Coherent What are these?
  • 96.
    NLG Head toToe @hadyelsahar Background: Entropy (information theory) Claude Shannon Page on wikipedia Imagine a process that generates samples e.g. Language Model PӨ that generates a sequence xi ~ PӨ –log(PӨ (xi )) is the “surprisal” Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. PӨ (xi ) –log(PӨ (xi )) If a model generates samples with low probability it will be have high surprisal on them Samples with higher probability the model is confident about them (i.e. low surprisal)
  • 97.
    NLG Head toToe @hadyelsahar Background: Entropy (information theory) Claude Shannon Page on wikipedia Imagine a process that generates samples e.g. Language Model PӨ that generates a sequence xi Entropy is the Expected level of “surprisal” Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Entropy is usually denoted by H How on average the model is surprised by its samples. I.e. how unconfident it is about its samples This is expectations = “mean value” Expectations in theory is calculated using infinite samples or closed form but could be approximated using large N samples
  • 98.
    NLG Head toToe @hadyelsahar Background: Entropy (information theory) Claude Shannon Page on wikipedia What does it mean that your model has low Entropy? Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Q : Imagine a language model that generates only one sample x= “elephants are smart” with PӨ (x) = 1 , what is the entropy of this language model H(PӨ ) ? a) 1 b) zero c) infinity ● Low entropy tells you that your model is not random (i.e. learned something) ● But it could be confident about the wrong things.
  • 99.
    NLG Head toToe @hadyelsahar Background: Entropy (information theory) Claude Shannon Page on wikipedia What does it mean that your model has low Entropy? Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Q : Imagine a language model that generates only one sample x= “elephants are smart” with PӨ (x) = 1 , what is the entropy of this language model H(PӨ ) ? a) 1 b) zero c) infinity No “surprisal” here the model is so confident about the only example it generates!
  • 100.
    NLG Head toToe @hadyelsahar Background: Entropy (information theory) Claude Shannon Page on wikipedia What does it mean that your model has low Entropy? Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Q : Imagine a language model that generates only one sample x= “elephants are smart” with PӨ (x) = 1 , what is the entropy of this language model H(PӨ ) ? a) 1 b) zero c) infinity No “surprisal” here the model is so confident about the only example it generates! What did you teach us about it if it is a bad metric!
  • 101.
    NLG Head toToe @hadyelsahar Cross Entropy Claude Shannon Page on wikipedia The “surprisal” of PӨ for samples generated from D (the true Language distribution). Cross Entropy can also be seen as a “closeness" measure between two distributions. Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951. Entropy Cross entropy direction matters
  • 102.
    NLG Head toToe @hadyelsahar Cross Entropy rate per token Given that we are interested in sentences (sequences of tokens) of length n, we will use the the entropy rate per token: Cross entropy rate per token Cross entropy This equality is true on in the limit when n is infinitely long, more details you can see Shannon-McMillan-Breiman theorem We can estimate the cross-entropy measuring model log prob on a random sample of sentences or a very large chunk of text. How do we know the probability of the true prob. Of a sentence in the whole language? Large number of random samples
  • 103.
    NLG Head toToe @hadyelsahar Which log to use ? In all the previous theory, the entropy and cross entropy are defined using log base 2 (with "bit" as the unit), “Popular machine learning frameworks, implement cross entropy loss using natural log. As it is faster to compute natural log as opposed to log base 2.” It is often not reported in papers which log they use, but mostly it is safe to assume the “natural log” Source: https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ Natural log base e Log base 2 is used in “information” theory due to its relation with bits and bytes
  • 104.
    NLG Head toToe @hadyelsahar Bits per Character BPC Cross entropy rate per word /char / token Bits per (character | Token | word) BPC BPT BPW If log base 2 is used, Cross-Entropy per word becomes BPW
  • 105.
    NLG Head toToe @hadyelsahar (Cross) Entropy and Compression Cross entropy rate per token Example from : https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf Imagine two language models PӨ and Pω of vocab size 10000 and a random Language model R Using all language model, compute per-word cross-entropy of a “very large” sentence D “Elephants are smart” CE(D,PӨ ) = CE(D, Pω ) = CE(D,R) = What does that mean?
  • 106.
    NLG Head toToe @hadyelsahar (Cross) Entropy and Compression Example from : https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf Using all language model, compute per-word cross-entropy of a “very large” sentence D “Elephants are smart” CE(D,PӨ ) = 5 CE(D, Pω ) = 11 CE(D,R) = 13.287 Language Models encode some text statistics that can be used for compression. If we designed an optimal code based on each model, we could encode the entire sentence in about: PӨ → 5 x 3 = 15 bits Pω → 11 x 3 = 33 bits R → 13.287 x 3 = 39.861 bits ASCII uses an average of 24 bits per word → 24 x 3 = 72 bits
  • 107.
    NLG Head toToe @hadyelsahar Perplexity (PPL) LM performance is often reported as perplexity rather than cross-entropy. Perplexity is simply: 2cross-entropy (if CE uses log2) Or ecross-entropy (if CE uses natural log) 6 bits cross-entropy means our model perplexity is 26 = 64: equivalent uncertainty to a uniform distribution over 64 outcomes. I.e. the language model randomly chooses between 64 random decisions at each time step. Example from : https://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/04_slides-2x2.pdf Reminder: cross entropy formula
  • 108.
    NLG Head toToe @hadyelsahar How to interpret Cross-Entropy / Perplexity / BPC ? Similar to all evaluation: - The model could be good or the corpus too easy - Only use it compare different models on the same corpus FYI - Break Comparison of GPT-2 (Different model sizes) on language model objective Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
  • 109.
    NLG Head toToe @hadyelsahar Entropy of English Language: - Entropy is the average number of bits to encode the information contained in a random variable - CrossEntropy(D, P) is the average number of bits to encode the information contained in a random variable D encoded using P - The Entropy (amount of information) in English Language has been a popular topic across linguists & computer scientists. FYI - Break
  • 110.
    NLG Head toToe @hadyelsahar Compression of English Language: “The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on enwik9 is the first 1,000,000,000 characters of a specific version of English Wikipedia. he prize awards 5000 euros for each one percent improvement (with 500,000 euros total funding)” https://en.wikipedia.org/wiki/Hutter_Prize FYI - Break
  • 111.
    NLG Head toToe @hadyelsahar Preplexity Perplexity Short Break
  • 112.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar - Conditional Language Model - Seq2seq Encoder-Decoder Models - Seq2seq Encoder-Decoder with Attention - Transformers (Self-Attention) - Open Problems of NLG Models trained on uncurated Web text Part 2: Stuff with Attention
  • 113.
    NLG Head toToe @hadyelsahar P(“The cat sits on the mat”) = 0.0001 Language Modeling P(“I am hungry” | “J’ai Faim”) = 0.1 (Conditional) Language Modeling Conditional Language Modeling Model probabilities of sequences of words Conditioned on a Context Context can be anything: ● Text ● Text in another language ● Image ● Speech We know: - How to calculate probabilities of sequences - Recurrent Neural Networks - How decode (generate) sequences
  • 114.
    NLG Head toToe @hadyelsahar Conditional Language Modeling P(“I am hungry” | “J’ai Faim”) = P(I | J’ai Faim) x P( am | I , J’ai Faim) x P( hungry | I am , J’ai Faim) Chain Rule applies
  • 115.
    NLG Head toToe @hadyelsahar Training: Conditional Language Modeling Context / input Previous generated tokens Maximum likelihood estimation : Cross Entropy loss (-ve log likelihood) loss also holds as a training objective
  • 116.
    NLG Head toToe @hadyelsahar Decoding / Inference All generation techniques can work in theory, some are more preferred than others. Local Global Deterministic Stochastic / Random Greedy decoding MAP Beam Search Ancestrall (pure) Sampling Top k Sampling Nucleus Sampling Sampling with Temperature
  • 117.
    NLG Head toToe @hadyelsahar Decoding / Inference All generation techniques can work in theory, some are more preferred than others. - In many tasks such as machine translation we care more about accuracy than diversity. - I.e. finding the globally highest likely sequence for an input. P(“I am hungry” | “J’ai Faim”) = 0.1 P(“I am happy” | “J’ai Faim”) = 0.002 P(“I was hungry” | “J’ai Faim”) = 0.0002 P(“he am hungry” | “J’ai Faim”) = 0.00001 Only one output translation y will be outputted for each x input , but that’s fine if it is correct.
  • 118.
    NLG Head toToe @hadyelsahar Language Modeling Conditional Language Modeling Modeling Modeling Training Objective Training Objective Decoding / Inference Decoding / Inference Greedy, Ancestral sampling, Beamsearch, Top-k, Nuclear sampling. Greedy, Ancestral sampling, Beamsearch, Top-k, Nuclear sampling.
  • 119.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar - Conditional Language Model - Seq2seq Encoder-Decoder Models - Seq2seq Encoder-Decoder with Attention - Transformers (Self-Attention) - Open Problems of NLG Models trained on uncurated Web text Part 2: Stuff with Attention
  • 120.
    NLG Head toToe @hadyelsahar Encoder-Decoder <s> j’ ai faim RNN RNN RNN RNN RNN RNN RNN RNN <s> am hungry I Initialize the decoder hidden state With the encoder final hidden state Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014) Encoder Decoder
  • 121.
    NLG Head toToe @hadyelsahar Encoder-Decoder <s> j’ ai faim RNN RNN RNN RNN I am hungry </s> cost cost cost cost Cross entropy loss Loss function Gradients updates through backpropagation RNN RNN RNN RNN <s> am hungry I In this architecture output of the softmax layer of the encoder RNN is not used. Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
  • 122.
    NLG Head toToe @hadyelsahar You can actually use the same RNN as an encoder and decoder, however this is the implementation in the original paper (Sutskever et al. 2014) “because doing so increases the number model parameters at negligible computational cost and makes it natural to train the RNN on multiple language pairs simultaneously.” FYI - Break Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
  • 123.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Part 2: Stuff with Attention - Conditional Language Model - Seq2seq Encoder-Decoder Models - Seq2seq Encoder-Decoder with Attention - Transformers (Self-Attention) - Open Problems of NLG Models trained on uncurated Web text
  • 124.
    NLG Head toToe @hadyelsahar Encoder-Decoder (limitations) <s> j’ ai faim RNN RNN RNN RNN I am hungry </s> cost cost cost cost RNN RNN RNN RNN <s> am hungry I What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties (Conneau et al. ACL 2018) The last hidden state is A single vector representing the whole input (Bottle neck)! The encoder is not able to compress the whole sentence into one vector .
  • 125.
    NLG Head toToe @hadyelsahar Attention Mechanism <s> j’ ai faim RNN RNN RNN RNN I cost RNN <s> NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015) Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016) Attention Feed fwd Weighted average of all encoder hidden states Decoder hidden state is used as a query
  • 126.
    NLG Head toToe @hadyelsahar Attention Mechanism <s> j’ ai faim RNN RNN RNN RNN I am cost cost RNN RNN <s> I Attention Weighted average of all encoder hidden states Decoder hidden state is used as a query Feed fwd NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015) Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
  • 127.
    NLG Head toToe @hadyelsahar Attention Mechanism <s> j’ ai faim RNN RNN RNN RNN I am hungry cost cost cost RNN RNN RNN <s> am I Attention Weighted average of all encoder hidden states Decoder hidden state is used as a query Feed fwd NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015) Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
  • 128.
    NLG Head toToe @hadyelsahar Attention Mechanism <s> j’ ai faim RNN RNN RNN RNN I am hungry </s> cost cost cost cost RNN RNN RNN RNN <s> am hungry I Attention Weighted average of all encoder hidden states Decoder hidden state is used as a query Feed fwd NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015) Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016)
  • 129.
    NLG Head toToe @hadyelsahar Attention Mechanism (Luong et al. 2016) Query (ht ): the hidden state of the decoder at time step t Keys and Value: the hidden states of the encoder: Attention Weights ɑt (s): Attention output (ct ): the output of the attention mechanism weighted sum of the Values according to the attention weights. Is combined with ht to predict the decoder output
  • 130.
    NLG Head toToe @hadyelsahar <s> j’ ai faim RNN RNN RNN RNN RNN <s> Feed fwd ht Attention Mechanism (Luong et al. 2016) ct
  • 131.
    NLG Head toToe @hadyelsahar Attention Mechanism (Luong et al. 2016) 3 type of Attention Weights calculations: ht dot general concat x x x trainable x trainable x The output of all score functions above is a float number
  • 132.
    NLG Head toToe @hadyelsahar Attention Mechanism (Luong et al. 2016) 3 type of Attention Weights calculations: ht dot general concat x x x trainable x trainable x The output of all score functions above is a float number Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016) global , local , and location are three other combinations read about them in the paper.
  • 133.
    NLG Head toToe @hadyelsahar FYI - Break NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Bahdanau et al. 2015) Effective Approaches to Attention-based Neural Machine Translation (luong et al. 2016) In Machine Translation Visualizing Attention weightsis a common practice. This shows words in the source sentence is important for the output of each word in the target (alignments). These alignments are learned end-to-end without explicit alignments between tokens in the source x and target y sentences.
  • 134.
    NLG Head toToe @hadyelsahar FYI - Break Seq2Seq models are one of the landmarks of the Deep Learning revolution for NLP. But soon got taken over by Self-Attention (transformers), which will see later in the next slides. 2018 GPT Transformers 2020 2019 2021 2017 2016 2015 2014 Bahdanau et al.15 Luong et al. 16 Sutskever et al. 14 Hermann et al. Neurips 2015 Rush et al. EMNLP2015
  • 135.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Part 2: Stuff with Attention - Conditional Language Model - Seq2seq Encoder-Decoder Models - Seq2seq Encoder-Decoder with Attention - Transformers (Self-Attention) - Open Problems of NLG Models trained on uncurated Web text
  • 136.
    NLG Head toToe @hadyelsahar Attention Is All You Need Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗ Neurips2017 Transformers “Attention is all you need” Self attention Residual (skip) connections Positional encoding Seq2seq Seq2Seq with Attention Transformers Encoding input RNN RNN Attention Decoding output RNN RNN Attention Encoder-Decoder interaction Fixed vector Attention Attention Idea from lena-voita.github.io/nlp_course N layers encoder and N layers Decoder. The last layer of encoder is connected to all decoder layers with Enc-dec attention Decoder Self attention Encoder decoder attention
  • 137.
    NLG Head toToe @hadyelsahar Attention Is All You Need Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗ Llion Jones∗ Aidan N. Gomez∗ † Łukasz Kaiser∗ Neurips2017 Transformers Lots of great resources to learn about Transformers: - Original blogpost from google - Lena Voita’s NLP For you - Micheal Phi’s Illustrated guide to transformers - Jay Alammar’s The illustrated Transformer - The annotated transformer (Alexander Rush, Vincent Nguyen and Guillaume Klein) - Karpathy’s minGPT “Attention is all you need” - Replace RNN with Attention - Representing Sequence order using positional encoding - Two types of attention: - Self Attention (new!) - Encoder-Decoder Attention - Multi-heads for attention - Skip connections allows better stacking to larger number of layers Encoder decoder attention Self attention Residual (skip) connections Positional encoding Decoder Self attention N layers encoder and N layers Decoder. The last layer of encoder is connected to all decoder layers with Enc-dec attention
  • 138.
    NLG Head toToe @hadyelsahar Transformers Encoder decoder attention Self attention Decoder Self attention Multi-Head Attention Multi-Head Attention consists of several attention layers running in parallel. Each one is a “scaled dot-product attention” The holy grail of the transformers
  • 139.
    NLG Head toToe @hadyelsahar Query Key Value “The concepts come from retrieval systems. The search engine will map your query against a set of keys associated with candidate results in the database, then present you the best matched videos (values).” https://stats.stackexchange.com/a/424127/22327 Multi-head Attention
  • 140.
    NLG Head toToe @hadyelsahar Multi-head Attention Encoder Self attention Encoder decoder attention Decoder Self attention 3 instances of Multi-head attention: 1. Encoder self-attention 2. Decoder Self-attention (Masked) 3. Encoder-Decoder Attention
  • 141.
    NLG Head toToe @hadyelsahar Multi-head Attention Encoder Self attention Encoder Encoder Encoder Encoder “Self-Attention” Represent each token in the encoder representation (input) by attending on other in the encoder (input) Instead of encoding the whole input using an RNN allow tokens to look at each other.
  • 142.
    NLG Head toToe @hadyelsahar Multi-head Attention Decoder Self attention Decoder Decoder Decoder Decoder “Self-Attention” Represent previously generated tokens token in the decoder by attending on other in the encoder (input) Previously this was the RNN Decoder hidden state.
  • 143.
    NLG Head toToe @hadyelsahar Encoder decoder attention Encoder Encoder Decoder Decoder “Self-Attention” Multi-head Attention Previously this was using the RNN Decoder “hidden state” to do attention over the encoder input representation
  • 144.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) Still how does multi-head attention work?
  • 145.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) Step 1: Linear projection of Key Query and Values Wv Wk Wq V K Q
  • 146.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) Step 2: Dot product of Queries and Keys K Q
  • 147.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) K Q x = Attention scores elephants are smart elephants are smart Step 2: Dot product of Queries and Keys 2 5 3 5 1 4 3 4 3 Q K
  • 148.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) 2 5 3 5 1 4 3 4 3 Step 3 : Scale down Attention scores divide by the square root of the dimension of query and key “We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients”
  • 149.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) 2 5 3 5 1 4 3 4 3 Step 4 : Softmax of the Scaled Scores Softmax across the Key dimension (Each row) Q K
  • 150.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) 0.2 0.5 0.3 0.5 0.1 0.4 0.3 0.4 0.3 Step 4 : Softmax of the Scaled Scores Q K
  • 151.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) 0.2 0.5 0.3 0.5 0.1 0.4 0.3 0.4 0.3 Step 5 : Multiply Softmax to values V x K2 K1 K3 Q1 Q2 Q3 V1 V2 V3 In this example we end up with 3 vectors each corresponds to the return of Q1 , Q2 and Q3 . Each is a weighted average of the values vectors according to the attention weights =
  • 152.
    NLG Head toToe @hadyelsahar Multi-head Attention (Deeper look) What does it mean to be “Multi-head” - Multiple parallel heads focus on different things each. - Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. linear linear linear Attention Head 1 V K Q linear linear linear Attention Head 2 V K Q Scale down to h is the number of parallel heads (=2 here) Embeddings of each token in the input size dmodel Output of each head is of size Concatenate outputs of each attention head Wo Output of a “2 head” multi-head attention
  • 153.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2 Now we know what does it mean by “self-attention” and“ multi-head attention” Still too many parts ... 5 Output layer
  • 154.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Self attention 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 155.
    NLG Head toToe @hadyelsahar Embedding layer Elephants are smart Word embeddings Positional embeddings Positional embeddings Since we are not using RNNs positional embeddings keep word order information Elephants are smart 0 1 2 smart are elephants 2 1 0 Odd Index: create a vector using the cos function. Even index: create a vector using the sin function. - Motivation: would allow the model to easily learn to attend by relative positions. - In practice: Indifferent from learned “positional embeddings” - Allows longer sequences at test time than those seen during training
  • 156.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2
  • 157.
    NLG Head toToe @hadyelsahar Encoder Self-attention Multi-Head Attention Elephants are smart The input will be copied 3 times as Q, K and V
  • 158.
    NLG Head toToe @hadyelsahar Encoder Residual connections Elephants Deep residual learning for image recognition CVPR 2015 https://arxiv.org/pdf/1512.03385.pdf Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth” Residual (skip) connections The signal flows from bottom to top of the layer (skip). are smart
  • 159.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 160.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Masked Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 161.
    NLG Head toToe @hadyelsahar Decoder (Masked) Self-attention éléphants sont intelligents Imagine a machine translation example Source(input) elephants are smart Target(output) éléphants sont intelligents sont We are are at time step 2: The Transformer generated a probability distribution corresponding to the correct token “éléphants” Now it is expected now to generate a probability distribution corresponding to the token “sont” <BOS>
  • 162.
    NLG Head toToe @hadyelsahar Decoder (Masked) Self-attention Imagine a machine translation example Source(input) elephants are smart Target(output) éléphants sont intelligents sont We are are at time step 2: The Transformer generated a probability distribution corresponding to the correct token “éléphants” Now it is expected now to generate a probability distribution corresponding to the token “sont” Problem:The answer is already given! The model will have nothing to predict it will just echo the input to the output. éléphants sont intelligents <BOS>
  • 163.
    NLG Head toToe @hadyelsahar Decoder (Masked) Self-attention sont Masked Self-Attention prevents the decoder from looking ahead. This is done inside the decoder Multi-head self-attention éléphants sont intelligents éléphants sont intelligents 1 4 4 1 4 3 2 1 4 2 3 1 1 1 1 2 Q K éléphants sont intelligents <BOS> <BOS> <BOS> 0 -inf -inf -inf 0 0 -inf -inf 0 0 -inf -inf 0 0 0 -inf x Timestep 1: When the target word is “elephant” you can get values corresponding to “<BOS>” from the input Timestep 2: When the target word is “sont” you can only see “<BOS>” and “éléphants” Mask
  • 164.
    NLG Head toToe @hadyelsahar Decoder (Masked) Self-attention <EOS> Masked Self-Attention prevents the decoder from looking ahead. This is done inside the decoder Multi-head self-attention éléphants sont intelligents éléphants sont intelligents 1 4 4 1 4 3 2 1 4 2 3 1 1 1 1 2 Q K éléphants sont intelligents <BOS> <BOS> <BOS> 0 -inf -inf -inf 0 0 -inf -inf 0 0 0 -inf 0 0 0 0 x Timestep 4: When the target word is “ <EOS>” you can only see the whole input.
  • 165.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Masked Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 166.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Masked Self attention Multi head attention is Multi-Head Attention consists of several attention layers running in parallel. 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 167.
    NLG Head toToe @hadyelsahar Decoder-Encoder Attention Encoder decoder attention éléphants sont intelligents <BOS> Elephants are smart Elephants are smart Embeddings of the last layer of the encoder are used as keys and values Embeddings of the *layer N* of the Decoder are used as keys and values
  • 168.
    NLG Head toToe @hadyelsahar Transformer in order Encoder Self attention Encoder decoder attention Decoder Self attention 1 3 5 Residual (skip) connections 2 Almost there! Output layer
  • 169.
    NLG Head toToe @hadyelsahar éléphants sont intelligents <BOS> Output Layer + Loss Function Output of each transformer layer is similar dimension to the encoded input Linear Transformation x = dmodel dmodel |Vocab| |Vocab| |Vocab| softmax = éléphants sont intelligents <EOS> No RNN here feed all output tokens at once and calculate loss MASKED self-attention will take care of illegal connections cost cost cost cost Reference sentence (1 hot encoded) Reference sent delayed by 1 time step Same Cross entropy loss
  • 170.
    NLG Head toToe @hadyelsahar Question Time Encoder Self attention Encoder decoder attention Multi-Head Attention Decoder Self attention 1 3 4 Residual (skip) connections 2 2 5 Output layer
  • 171.
    NLG Head toToe @hadyelsahar Q: Transformers are seq2seq can we use them for unconditional LM?
  • 172.
    NLG Head toToe @hadyelsahar Q: Transformers are seq2seq can we use them for unconditional LM? Yes but redundant Elephants are Elephants are Smart
  • 173.
    NLG Head toToe @hadyelsahar Q: Transformers are seq2seq can we use them for unconditional LM? GPT Yes but redundant Masked Multi-Head Attention Add &Norm Add & Norm Feed Forward Decoder only Transformer Elephants are Elephants are Smart - No Encoder - No Decoder-Encoder Attention Last Technical slide
  • 174.
    NLG Head toToe @hadyelsahar Q: How such “transform”ative ideas come up? A: Good teamwork between 8 authors FYI - Break “Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, our initial codebase, and efficient inference and visualizations. Lukasz and Aidan designing various parts of and implementing tensor2tensor library”
  • 175.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Part 2: Stuff with Attention - Conditional Language Model - Seq2seq Encoder-Decoder Models - Seq2seq Encoder-Decoder with Attention - Transformers (Self-Attention) - Open Problems of NLG Models trained on uncurated Web text
  • 176.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Large Language Models And Their Dangers
  • 177.
    NLG Head toToe @hadyelsahar 177 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Rush et al. EMNLP2015 GPT “Our approach is a combination of two existing ideas: transformers and unsupervised pre-training.” GPT was created Originally for NLU !
  • 178.
    NLG Head toToe @hadyelsahar 178 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Sources: The Guardian, the next web “As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with.”
  • 179.
    NLG Head toToe @hadyelsahar 179 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Big claims on unprecedented generation capabilities!
  • 180.
    NLG Head toToe @hadyelsahar 180 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Sources: The Guardian, the next web
  • 181.
    NLG Head toToe @hadyelsahar 181 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Criticisms to Neural Language Generation ! Neural Unicorns should be put on a leash
  • 182.
    NLG Head toToe @hadyelsahar 182 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Holtzman et al. ICLR2020 Degeneration
  • 183.
    NLG Head toToe @hadyelsahar 183 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 “ NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Symbolic AI & Semantic Correctness “for a human or a machine to learn a language, they must solve what Harnad (1990) calls the symbol grounding problem.” Form vs Meaning Fluency ≠ Semantic Correctness O observes that certain words tend to occur in similar contexts .. learns to generalize across lexical patterns by hypothesizing that they can be used interchangeably. O has never observed these objects, and thus would not be able to pick out the referent of a word when presented with a set of (physical) alternatives.
  • 184.
    NLG Head toToe @hadyelsahar Petroni et al. EMNLP 2019 184 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Jiang et al. TACL 2020 Seq2Seq Attention Abstractive summ. Factual Correctness Kassber al. ACL 2020
  • 185.
    NLG Head toToe @hadyelsahar 185 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Discussions around AI Ethics 🚨 Gender Shades: Buolamwini 2017
  • 186.
    NLG Head toToe @hadyelsahar 186 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. source: MIT Technologyreview source: https://twitter.com/minimaxir/
  • 187.
    NLG Head toToe @hadyelsahar 187 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. It is not about only single examples. Itʼs also a Distributional Bias Abubakar abid keynote at #muslimsinAI workshop in Neurips 2020 https://twitter.com/shakir_za/status/1336335755656929288?lang=en https://twitter.com/abidlabs/status/1291165311329341440?lang=en
  • 188.
    NLG Head toToe @hadyelsahar 188 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Prates al. Neural computation 2019 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Distributional Bias
  • 189.
    NLG Head toToe @hadyelsahar 189 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Distributional Bias Stanovsky et al. ACL2019
  • 190.
    NLG Head toToe @hadyelsahar 190 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Distributional Bias (Open ended NLG ) Sheng et al. EMNLP 2019 Sentiment
  • 191.
    NLG Head toToe @hadyelsahar 191 2018 GPT Gebru et al. FACT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Distributional Bias (cloze style) Nadeem et al. 2020
  • 192.
    NLG Head toToe @hadyelsahar 192 2018 GPT Gebru et al. FACcT2021 Stochastic parrots Transformers 2020 2019 2021 2017 2016 NLG Trends 2015 2014 Bahdanau et al.15 Microsoft Tay (racist Chatbot) Sutskever et al. 14 Hermann et al. Neurips 2015 Seq2Seq Attention Abstractive summ. Timnit Gebru [Left] Margaret Mitchell [right], Were fired from google over “The Stochastic parrots” paper https://www.wired.com/story/second-ai-researcher-says-fired-google/ Read the paper (Bender et al. FAccT21)
  • 193.
    NLG Head toToe @hadyelsahar NLG head to toe- @Hadyelsahar Huh .. ? Hady Elsahar @hadyelsahar Hady.elsahar@naverlabs.com That’s All folks Great Help me to make this tutorial better please participate in this anonymous survey : https://forms.gle/Xr93EFiY2zStksMK8 Also reach out for feedback or questions