Deep contextualized
word representations
Matthew E. Peters, Mark Neumann, Mohit
Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, Luke Zettlemoyer
NAACL-HLT 2018
Overview
• Propose a new type of deep contextualised word
representations (ELMo) that model:
‣ Complex characteristics of word use (e.g., syntax and
semantics)
‣ How these uses vary across linguistic contexts (i.e., to
model polysemy)
• Show that ELMo can improve existing neural models in
various NLP tasks
• Argue that ELMo can capture more abstract linguistic
characteristics in the higher level of layers
33
• Overview
• Method
• Evaluation
• Analysis
• Comments
Example
34
GloVe mostly learns
sport-related context
ELMo can distinguish the word
sense based on the context
Method
• Embeddings from Language Models: ELMo
• Learn word embeddings through building
bidirectional language models (biLMs)
‣ biLMs consist of forward and backward LMs
✦ Forward:
✦ Backward:
35
• Overview
• Method
• Evaluation
• Analysis
• Comments
p(t1, t2, …, tN) =
N
∏
k=1
p(tk |t1, t2, …, tk−1)
p(t1, t2, …, tN) =
N
∏
k=1
p(tk |tk+1, tk+2, …, tN)
Method
With long short term memory (LSTM) network,
predicting the next words in both directions to build
biLMs
36
have a nice one …
Output layer
Hidden layers
(LSTMs)
Embedding layer
…
a nice one
xk
ok
k − 1
k − 1
Expanded in the forward direction of kThe forward LM architecture
• Overview
• Method
• Evaluation
• Analysis
• Comments
tk
hLM
k1
hLM
k2
Method
ELMo represents a word   as a linear combination of
corresponding hidden layers (inc. its embedding)
37
tk
xk
ok
k − 1
k − 1
hLM
k1
ok
k + 1
k + 1
hLM
k2
hLM
k1
hLM
k2
Forward LM Backward LM
tk tk
stask
0
stask
1
stask
2
Concatenate
hidden layers
hLM
k1
hLM
k2
hLM
k0
([xk ; xk])
×
×
× [hLM
kj ; hLM
kj ]
{
ELMotask
k ×γtask
∑=
Unlike usual word embeddings, ELMo is
assigned to every token instead of a type
biLMs
ELMo is a task specific
representation. A down-stream
task learns weighting parameters
ELMo can be integrated to almost all neural NLP tasks
with simple concatenation to the embedding layer
Method
38
• Overview
• Method
• Evaluation
• Analysis
• Comments
have a nice
ELMo ELMo ELMo
biLMs
Corpus
Usual inputs
Enhance inputs
with ELMosTrain
Many linguistic tasks are improved by using ELMo
39
Evaluation
• Overview
• Method
• Evaluation
• Analysis
• Comments
Q&A
Textual entailment
Sentiment analysis
Semantic role labelling
Coreference resolution
Named entity recognition
Analysis
The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features
40
• Overview
• Method
• Evaluation
• Analysis
• Comments
Word sense disambiguation PoS tagging
Analysis
The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features???
41
• Overview
• Method
• Evaluation
• Analysis
• Comments
Most models preferred
“syntactic (probably)” features
Even in sentiment analysis
Analysis
ELMo-enhanced models can make use of small
datasets more efficiently
42
• Overview
• Method
• Evaluation
• Analysis
• Comments
Textual entailment Semantic role labelling
Comments
• Pre-trained ELMo models are available at https://
allennlp.org/elmo
‣ AllenNLP is a deep NLP library on top of PyTorch
‣ AllenNLP is a product of AI2 (Allen Institute for
Artificial Intelligence) which works on other
interesting projects like Semantic Scholar
• ELMo can process character-level inputs
‣ Japanese (Chinese, Korean, …) ELMo models likely
to be possible
43
• Overview
• Method
• Evaluation
• Analysis
• Comments

A Review of Deep Contextualized Word Representations (Peters+, 2018)

  • 1.
    Deep contextualized word representations MatthewE. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer NAACL-HLT 2018
  • 2.
    Overview • Propose anew type of deep contextualised word representations (ELMo) that model: ‣ Complex characteristics of word use (e.g., syntax and semantics) ‣ How these uses vary across linguistic contexts (i.e., to model polysemy) • Show that ELMo can improve existing neural models in various NLP tasks • Argue that ELMo can capture more abstract linguistic characteristics in the higher level of layers 33 • Overview • Method • Evaluation • Analysis • Comments
  • 3.
    Example 34 GloVe mostly learns sport-relatedcontext ELMo can distinguish the word sense based on the context
  • 4.
    Method • Embeddings fromLanguage Models: ELMo • Learn word embeddings through building bidirectional language models (biLMs) ‣ biLMs consist of forward and backward LMs ✦ Forward: ✦ Backward: 35 • Overview • Method • Evaluation • Analysis • Comments p(t1, t2, …, tN) = N ∏ k=1 p(tk |t1, t2, …, tk−1) p(t1, t2, …, tN) = N ∏ k=1 p(tk |tk+1, tk+2, …, tN)
  • 5.
    Method With long shortterm memory (LSTM) network, predicting the next words in both directions to build biLMs 36 have a nice one … Output layer Hidden layers (LSTMs) Embedding layer … a nice one xk ok k − 1 k − 1 Expanded in the forward direction of kThe forward LM architecture • Overview • Method • Evaluation • Analysis • Comments tk hLM k1 hLM k2
  • 6.
    Method ELMo represents aword   as a linear combination of corresponding hidden layers (inc. its embedding) 37 tk xk ok k − 1 k − 1 hLM k1 ok k + 1 k + 1 hLM k2 hLM k1 hLM k2 Forward LM Backward LM tk tk stask 0 stask 1 stask 2 Concatenate hidden layers hLM k1 hLM k2 hLM k0 ([xk ; xk]) × × × [hLM kj ; hLM kj ] { ELMotask k ×γtask ∑= Unlike usual word embeddings, ELMo is assigned to every token instead of a type biLMs ELMo is a task specific representation. A down-stream task learns weighting parameters
  • 7.
    ELMo can beintegrated to almost all neural NLP tasks with simple concatenation to the embedding layer Method 38 • Overview • Method • Evaluation • Analysis • Comments have a nice ELMo ELMo ELMo biLMs Corpus Usual inputs Enhance inputs with ELMosTrain
  • 8.
    Many linguistic tasksare improved by using ELMo 39 Evaluation • Overview • Method • Evaluation • Analysis • Comments Q&A Textual entailment Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition
  • 9.
    Analysis The higher layerseemed to learn semantics while the lower layer probably captured syntactic features 40 • Overview • Method • Evaluation • Analysis • Comments Word sense disambiguation PoS tagging
  • 10.
    Analysis The higher layerseemed to learn semantics while the lower layer probably captured syntactic features??? 41 • Overview • Method • Evaluation • Analysis • Comments Most models preferred “syntactic (probably)” features Even in sentiment analysis
  • 11.
    Analysis ELMo-enhanced models canmake use of small datasets more efficiently 42 • Overview • Method • Evaluation • Analysis • Comments Textual entailment Semantic role labelling
  • 12.
    Comments • Pre-trained ELMomodels are available at https:// allennlp.org/elmo ‣ AllenNLP is a deep NLP library on top of PyTorch ‣ AllenNLP is a product of AI2 (Allen Institute for Artificial Intelligence) which works on other interesting projects like Semantic Scholar • ELMo can process character-level inputs ‣ Japanese (Chinese, Korean, …) ELMo models likely to be possible 43 • Overview • Method • Evaluation • Analysis • Comments