What Do Neural Models "Know" About Natural Language?
Ekaterina Vylomova
Vylomova, Ekaterina Neural models and Natural Language 1 / 53
1943: Artificial Neuron (McCulloch-Pitts)
... or, in other words, ˆy = f ( n
i=1 wi xi + b),
Vylomova, Ekaterina Neural models and Natural Language 2 / 53
1943: Artificial Neuron (McCulloch-Pitts)
... or, in other words, ˆy = f ( n
i=1 wi xi + b),
and activation function might be sigmoid: sig(x) = 1
1+e−x
Vylomova, Ekaterina Neural models and Natural Language 3 / 53
1957: Simple Perceptron
The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain
Trained with trial-and-error method
It can:
– generalize over characters
– discover character-specific features
But:
– failed to recognized badly written/different
size/partially closed characters
Vylomova, Ekaterina Neural models and Natural Language 4 / 53
1960s: Single Layer Perceptron
The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain
Perceptrons: an introduction to
computational geometry
XOR Problem
Vylomova, Ekaterina Neural models and Natural Language 5 / 53
1980s: Multi-Layer Perceptrons with Back-Propagation
Learning Internal Representations by Error Propagation
Solving problems with non-linearly
separable cases
Vylomova, Ekaterina Neural models and Natural Language 6 / 53
1980s: The Past Tense Debate
Rumelhart & McClelland (1985): On learning the past tenses of
English verbs
Vylomova, Ekaterina Neural models and Natural Language 7 / 53
1980s: The Past Tense Debate
Rumelhart & McClelland (1985): On learning the past tenses of
English verbs
Pinker &Prince, 1988: Extremely poor
empirical performance!
Vylomova, Ekaterina Neural models and Natural Language 8 / 53
1990s: RNNs
Finding structure in time
Exploring
– context-dependent learning
– structure in letter sequences
– learning lexical classes from word order
Vylomova, Ekaterina Neural models and Natural Language 9 / 53
1990s: CNNs
Backpropagation Applied to Handwritten Zip Code Recognition
Training
Data: 9,298 segmented numerals from U.S.
mail
Mislassified: Training – 0.14%; Test – 5.0%
Vylomova, Ekaterina Neural models and Natural Language 10 / 53
Meanwhile in NLP: Language Modelling (mostly Ngrams with Kneser-Ney
smoothing)
OK, Marvin, which word comes next: Two cats are ___
Hmmm, let me guess ...
sitting 3.01 ∗ 10−4
play 2.87 ∗ 10−4
running 2.53 ∗ 10−4
nice 2.32 ∗ 10−4
lost 1.97 ∗ 10−4
playing 1.66 ∗ 10−4
sat 1.54 ∗ 10−4
plays 1.32 ∗ 10−4
. .Vylomova, Ekaterina Neural models and Natural Language 11 / 53
2013: Word2Vec Skip-Gram
Distributed Representations of Words and Phrases and their
Compositionality
Training Objective
1
T
T
t=1 −c≤j≤c logp(wt+j |wt)
p(wo|wi ) = exp(v T
wo vwi )
W
w=1 exp(v T
w vwi )
For efficiency, softmax was replaced with Negative
Sampling.
Levy et al., 2015 experimented with positive pointwise
mutual information (PMI) matrix and showed that
Word2vec Skip-Gram with NS is implicit matrix
factorization.
Vylomova, Ekaterina Neural models and Natural Language 12 / 53
2013: Word2Vec CBOW
Efficient Estimation of Word Representations in Vector Space
Training Objective
1
T
T
t=1 logp(wt|w[t−c,t+c])
p(wo|wi ) =
exp(v T
wo −c≤j≤c vwi+j
)
W
w=1 exp(v T
w −c≤j≤c vwi+j
)
Vylomova, Ekaterina Neural models and Natural Language 13 / 53
2013: Word2Vec: Word Analogies
Linear Relations and Compositionality: Russia + river =
Volga_river
Vylomova, Ekaterina Neural models and Natural Language 15 / 53
2013: Word2Vec: Word Analogies
Linear Relations and Compositionality: king-man+woman = queen?
Vylomova, Ekaterina Neural models and Natural Language 16 / 53
Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen
Vylomova, Ekaterina Neural models and Natural Language 17 / 53
Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen
Vylomova, Ekaterina Neural models and Natural Language 18 / 53
Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen
Vylomova, Ekaterina Neural models and Natural Language 19 / 53
Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen
Vylomova, Ekaterina Neural models and Natural Language 20 / 53
Word Analogies on other embeddings
Word Embeddings, Analogies, and Machine Learning: Beyond King
- Man+ Woman= Queen
Vylomova, Ekaterina Neural models and Natural Language 21 / 53
Pre-trained Word2Vec (Google News): Bias and Stereotypes
Man is to Computer Programmer as Woman is to Homemaker?
Vylomova, Ekaterina Neural models and Natural Language 22 / 53
Word2vec trained of Reddit data: Bias and Stereotypes
Black is to Criminal as Caucasian is to Police
Vylomova, Ekaterina Neural models and Natural Language 23 / 53
Data Bias and Stereotypes
Gendered Language
Positive adjectives describing women are often related to their bodies, while positive adjectives
describing men are often related to their behavior.
Vylomova, Ekaterina Neural models and Natural Language 24 / 53
Word2Vec and similar models
What do the models learn?
Morphology
– Are capable of learning inflections but not much derivations (less regular and compositional)
Lexical Semantics
– Challenging, especially meronyms, antonyms, synonyms
Major Difficulties
– Polysemy (all word senses in a single vector)
– Negation
Vylomova, Ekaterina Neural models and Natural Language 25 / 53
Broader context – back to RNNs!
Vylomova, Ekaterina Neural models and Natural Language 26 / 53
Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014)
Vylomova, Ekaterina Neural models and Natural Language 27 / 53
The resulting LSTM has 384M params
64M are pure recurrent connections
BUT: Longer contexts – lower quality (vanishing gradient)
Long Short-Term Memory will solve it!
Vylomova, Ekaterina Neural models and Natural Language 28 / 53
Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014
Vylomova, Ekaterina Neural models and Natural Language 29 / 53
PCA projection of LSTM hidden state of the corresponding sequences
We can also use both directions (to encode source language)
Vylomova, Ekaterina Neural models and Natural Language 30 / 53
Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
A whole sentence shouldn’t be compressed into a single vector! Use
Attention!
Vylomova, Ekaterina Neural models and Natural Language 31 / 53
Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
A whole sentence shouldn’t be compressed into a single vector! Use
Attention!
Vylomova, Ekaterina Neural models and Natural Language 32 / 53
Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
It learns alignment and it can be visualized!
Vylomova, Ekaterina Neural models and Natural Language 33 / 53
Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
What do the models learn?
Belinkov et al., 2018a, 2018b
– Higher layers are better at learning semantics while lower layers tend to be better for
part-of-speech tagging
– Lower layers of the neural network are better at capturing morphology
Linzen et al., 2018, 2020
English Subject-Verb agreement:
–LSTMs were able to learn to perform the verb-number agreement task in most cases, although
their error rate increased on particularly difficult sentences.
– the LM objective is not by itself sufficient for learning structure-sensitive dependencies, and
suggest a joint training objective
Vylomova, Ekaterina Neural models and Natural Language 34 / 53
Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al,
2014)
What do the models learn?
Vylomova et al., 2019
– Contextual inflection in 10 languages: Three little kitten were _sit_ on the mat. Predict:
sitting
– Agreement: Adjective-Noun ok, Subject-Verb more challenging
– Morphological complexity matters (Uralic languages are more challenging than Germanic)
– Inherent vs. contextual categories. Inherent (tense, noun number, w/o agreement or extra
signal) cannot be predicted
Vylomova, Ekaterina Neural models and Natural Language 35 / 53
Back to Past Tense Debate: Seq2Seq Models w/Attention
Kirov & Cotterell,2018: The model obviates most of Pinker and
Prince’s criticisms
SIGMORPHON 2016 Shared Task
Task 1: run + V;PRES;3SG → (runs)
On Arabic, Finnish, Georgian, German, Hungarian, Maltese, Navajo, Russian, Spanish
Vylomova, Ekaterina Neural models and Natural Language 36 / 53
Lake et al., 2018: Compositionality of RNNs
Vylomova, Ekaterina Neural models and Natural Language 37 / 53
Simplified version of the CommAI Navigation tasks
Lake et al., 2018: Compositionality of RNNs
Vylomova, Ekaterina Neural models and Natural Language 38 / 53
Simplified version of the CommAI Navigation tasks
Successful zero-shot generalizations when the differences between training and test command
Trained on "run", "jump" and "run twice" fails on "jump twice"
Contextualized Embeddings: Addressing the problem with polysemy!
Context matters! ELMo: Let’s make context-specific embeddings!
Features
– Two Independent(!) LSTMs
– Pre-trained embeddings
– Weighted-task specific sum of
embeddings (two hidden state +
word vector)
Vylomova, Ekaterina Neural models and Natural Language 39 / 53
Self-Attention (Cheng et al., 2016)
Relate parts of a single sequence to compute its representation
Vylomova, Ekaterina Neural models and Natural Language 40 / 53
Shows similarity to other parts!
Helpful for coreference resolution!
Contextualized Embeddings
Transformer: Attention is All you Need
Features
– No recursion but wide window (somewhat similar to
CNN)
– positional embeddings (to access token positions)
– Self-attention with several heads (matrices) and separate
key, query and value (masks)
Vylomova, Ekaterina Neural models and Natural Language 41 / 53
Contextualized Embeddings
BERT: Deep Bidirectional Transformers
Features
– Trained on: Masked tokens prediction + Next sentence prediction (binary) – BPE tokenization
– Window: 512, CLS – classification
Vylomova, Ekaterina Neural models and Natural Language 42 / 53
Contextualized Embeddings: Word Sense Disambiguation
Word Sense Disambiguation
"A mouse consists of an object held in one’s hand, with one or more buttons."
"Mouse" – an electronic device
Vylomova, Ekaterina Neural models and Natural Language 46 / 53
Contextualized Embeddings: Word Sense Disambiguation
Word Sense Disambiguation
"A mouse consists of an object held in one’s hand, with one or more buttons."
"Mouse" – an electronic device
Vylomova, Ekaterina Neural models and Natural Language 47 / 53
Contextualized Embeddings: Coreference Resolution
Coreference resolution task
The secretary called the physician and told _him_ about a new patient.
him → physician
Vylomova, Ekaterina Neural models and Natural Language 48 / 53
Contextualized Embeddings: Coreference Resolution
Gender Bias in Coreference Resolution
WinoBias: a Winograd-schema style sentences with entities corresponding to people referred by
their occupation
Vylomova, Ekaterina Neural models and Natural Language 49 / 53
Contextualized Embeddings: Bias, bias, bias
Zhao et al., 2019
– Coref SOTA system that depends on ELMo inherits its bias and demonstrates significant bias
on the WinoBias
– training data for ELMo contains significantly more male than female entities
– the trained ELMo embeddings systematically encode gender information
– ELMo unequally encodes gender information about male and female entities
Vylomova, Ekaterina Neural models and Natural Language 50 / 53
Contextualized Embeddings: What does BERT know (Rogers et al., 2020)?
Syntax
– Representations are hierarchical rather than linear and encode POS and syntactic roles(Liu et
al., 2019a,b)
– Does not “understand” negation and is insensitive to malformed input (Ettinger, 2019)
Semantics
– Has some knowledge for semantic roles(Ettinger, 2019)
– Struggles with representations of numbers (floating point; Wallace et al., 2019b)
World Knowledge
– Cannot reason based on its world knowledge ("A dog entered the room" doesn’t yield that
"room is larger than the dog")
Vylomova, Ekaterina Neural models and Natural Language 51 / 53
Extra resources
NLP Progress
Hugging Face – Models
"Embeddings in Natural Language Processing" book
"Dive into Deep Learning" interactive book
Vylomova, Ekaterina Neural models and Natural Language 52 / 53