Machine Learning Scientist at AWS Shares Recent Advances in NLP

Thomas Delteil – Machine Learning Scientist @ AWS AI
tdelteil@amazon.com
8th March 2018
Recent advances in
Natural Language Processing

Objective
- NLP domain overview
- Traditional methods
- Word Embeddings (word2vec)
- Contextualized word embeddings (ELMo)
- Bidirectional Encoder Representation from Transformers
(BERT)
- Generative Pre-Training 2 (GPT-2)

What is covered in NLP
Text classification

Language Modelling
𝑃(𝑤𝑡|𝑤𝑡−1, 𝑤𝑡−2, … )
See you later […]
alligator
today
𝑃(𝑤𝑡|𝑤𝑡+1, 𝑤𝑡+2, … )
[…] abhors a vacuum
Nature
Fido

David Gascoyne
Automatic Text Generation
http://botpoet.com
The crow crooked on more beautiful and free,
He journeyed off into the quarter sea.
His radiant ribs girdled empty and very
least beautiful as dignified to see.
The smooth plain with its mirrors listens to the cliff
Like a basilisk eating flowers.
And the children, lost in the shadows of the catacombs,
Call to the mirrors for help:
“Strong-bow of salt, cutlass of memory,
Write on my map the name of every river.”

Natural Language Understanding
Alexa, remind me to
buy groceries after work
Intent detection:
Create Reminder
Slot filling:
What
When
Where
Alexa, remind me to
buy groceries after work

Machine Translation
Sometimes, in the morning, I wonder whether AI bots will kill us all
時々、午前中に、AIボットが私たち全員を殺すのだろうか？
Text Summarization
A Neural Attention Model for Abstractive Sentence Summarization, Alexander M. Rush et al. 2015

Question Answering:
“Who was president when Barack Obama was born?”
John Fitzgerald Kennedy
Part of speech tagging
Sentence similarity
Commonsense Reasoning
Coreference Resolution
…

Classical Methods
Text representation:
Lexicon based  quickly explodes with N >> 10000
 Text preprocessing

Text Pre-Processing
I’d love to drive again in the mountainous roads of Crete.
I would love to drive again in the mountainous roads of crete.
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
would · love · drive · again · mountainous · roads · crete · .
would · love · drive · again · mountain · road · crete · .
Normalization Tokenization Stop words removal Lemmatization

Grapheme/Token representation: One-Hot encoding
Define words as a vector
I’d love to drive … preprocessing
drive
love
would
would love drive1
0
0
0
1
0
0
0
1

Sentence representation: Bag of words
Sum of one-hot encoded word vectors
I’d love to drive …
drive
love
would
I’d love to drive
Dictionary size = 3
If dictionary size >>> 1
Very sparse!
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
1
1

TF*IDF
Term frequency inverse document frequency
TF =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠
IDF = 𝑙𝑛
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛
0
0
0
0
2.3
0
0
0.1
0
0
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
0
1.2
0
0
0
0.5
0
Classifiers
SVM
MLP
Naïve Bayes
XGBoost

Limitations: no semantic information
With one-hot encoding:
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
- -= = √2
2 2
|| vautomobile – vcar ||2 = || vautomobile – vmountain ||2 = √2
Ideally we would want:
|| vautomobile – vcar ||2 ≈ 0

Word order matters
• Context dependent information
• The place of the word in the sentence matters
My kindle is easy to use,
I do not need help
I do need help, my kindle
is not easy to use

Better grapheme representation
Better context understanding

Word2vec: Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 13 2013
Learn word embeddings:
Skip-gram: predict context given center word
Continuous Bag of Words (CBOW): predict center word given context
CBOW model
… The cake is a lie …
Context words at
t-2 and t-1
Context words at
t+1 and t+2
Word to predict at t
Estimate: 𝑃(𝑤𝑡|𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡+2)

Learning process ℒ = −log(𝑃 𝑤𝑡 𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡 )
source: https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html

source: https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html

Using Word Representation in Neural Networks
Amazon is amazing
2910 79 1927
W2910 W79 W1927
W1
W2
…
Wi
…
W|V|
Neural Layers
?
Output Layer
{Wi} are the word embeddings. They are
parameters that the Neural Networks
can modify through. Can be pre-trained.
indexing
lookup
1
|V|
N

Recurrent Neural Network: Language Modelling
RNNRNNRNN
h0 h1 h2
hinit
RNN
h3
Proj Proj Proj Proj
𝑃 𝑤 ℎ0 𝑃 𝑤 ℎ1 𝑃 𝑤 ℎ2 𝑃 𝑤 ℎ3
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1
N
1
1 2910 79 1927
𝑙𝑜𝑠𝑠 = − log 𝑃( 𝑤=𝐴𝑚𝑎𝑧𝑜𝑛 ℎ0)) − log 𝑃( 𝑤=𝑖𝑠 ℎ1)) − log 𝑃( 𝑤=𝑎𝑚𝑎𝑧𝑖𝑛𝑔 ℎ2)) − log 𝑃( 𝑤=<𝐸𝑂𝑆> ℎ3))

Convolutional Neural Network for Text Classification
Source: Character-level Convolutional Networks for Text Classification,
Zhang et al. 15
Embeddings
Time
Time

Convolutional Neural Network
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1 C0,0
C
C
N
1
N
T
1 2910 79 1927 2

C0,0
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1 C0,0
C0,1
C
N
1
N
T
1 2910 79 1927 2

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C
C0,0
C0,1
C0,2

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,0
C5,1
C5,2

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0C3,0
C3,1

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0
C3,1
Receptive field allows
long range
dependencies

W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0
C3,1
…
x0
x1
x…
xn-2
xn-1
xn
Wpos
Wneut
Wneg
softmax
Pos 92%
Neutral 8%
Neg 0%
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2

Limitations
• Rare words are not well represented or just <UNK>
Half-way solutions:
• fastText and sum of subwords embeddings
• Character ngrams
• Byte Pair Encoding (BPE)

Limitations
Polysemy: meaning of a word
• Java
• Python
Depends on the context
• I love travelling. I am going to explore Java.
https://en.wikipedia.org/wiki/Java

Context can be bidirectional:
I went to the bank, to drop off some money
Context can be bidirectional:
I went to the bank, to drop off some money
Limitations

ELMo Embeddings (Peters et al. 18)
Contextualized word embeddings
𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠
Embedding (Char-CNN)
𝒉 𝒉 𝒉 𝒉
𝒉𝒉𝒉 𝒉
𝑆𝑖𝑆0
𝑆0
′
𝑆𝑖
′
Θ 𝑒
Θ𝑗 𝐿𝑆𝑇𝑀
SoftmaxΘ 𝑠
𝑦1 𝑦2 𝑦3 𝑦 𝑛Pre-Training on bidirectional
language modelling:

ELMo Embeddings (Peters et al. 18))
𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠
𝒉 𝒉
𝒉𝒉 𝒉
𝑆0
𝑆0
′
Θ 𝑒
SoftmaxΘ 𝑠
𝑦1 𝑦2 𝑦3 𝑦 𝑛

Contextualized word embeddings
Fine-Tuning:
Task Specific Neural Network
𝑅1 𝑅2 𝑅3 𝑅 𝑛
Learnt linear combination of hidden states

ELMo
Softmax Layer
Input sentence
Output (probabilities over V)
Pretraining on Language Model
Your task NN
Input sentence
Output
Train on your Task

BERT (Devlin et al. 18)
Bidirectional Encoder Representations from Transformers (BERT)

Inspired by “Improving Language Understanding by Generative Pre-Training”,
Radford et al. 2018 GPT-1 OpenAI
Based on Transformer and the multi-head self-attention model
Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Self-attention: “The apple is red, it is delicious”
The apple is red , it is delicious

INPUT
WordPieces
Embeddings
Sentence
Embeddings
Position
Embeddings
BERT INPUT REPRESENTATION:
Learned during the
(pre)training
process
MASK
EMASK
In pre-training 15% of the input tokens
are masked for the masked LM task

Training objects in slightly modified BERT models for downstream
tasks. (Image source: original paper)
Fine-tuning
For classification tasks:
token 𝐶𝐿𝑆 , 𝒉 𝐿
[𝐶𝐿𝑆]
Small weight matrix W:
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒉 𝐿
𝐶𝐿𝑆
𝑾 𝐶𝐿𝑆)

- No need for custom Neural Network for fine-tuning

XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Problems with Bert:
1. The [MASK] token used in training does not appear during fine-tuning
2. BERT generates predictions independently
I went to [MASK] [MASK] and saw the [MASK] [MASK] [MASK].

XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Bidirectionnal context through randomized prediction of ordered tokens

OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
Source: original paper
Trained on language model task
40GB corpus dataset
1.5B parameters

All the downstream language tasks are framed as predicting conditional
probabilities and there is no task-specific fine-tuning.
Zero-shot learning:
Summarization:
𝑃 𝑤 “𝑡𝑒𝑥𝑡 𝑡𝑜 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒” + “ 𝑇𝐿; 𝐷𝑅: <? > ”)
Question Answering:
𝑃 𝑤 “𝑡𝑒𝑥𝑡” + “𝑄: … 𝐴: … 𝑄: … 𝐴: <? > ”)
Machine Translation:
𝑃 𝑤 “𝐼 𝑙𝑖𝑘𝑒 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟𝑠 = 𝐽′ 𝑎𝑖𝑚𝑒 𝑙𝑒𝑠 𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑢𝑟𝑠; 𝐼 𝑙𝑖𝑣𝑒 𝑖𝑛 𝑉𝑎𝑛𝑐𝑜𝑢𝑣𝑒𝑟 =<? > ”)

Source: https://blog.openai.com/better-language-models/

Source: original paper

Conclusion
- Count-based word representation (tf-idf)
- Learnt word representation (word2vec)
- Contextualized embeddings + custom network (ELMo)
- Sentence embeddings + fine-tuning (BERT, XLNet)
- Zero-short transfer with large language model (GPT-2)
Language representation
Task specific adaptation

References
Word embeddings:
LSA - Indexing by latent semantic analysis, Dumais et al. 1990
Word2Vec - Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013
GloVe - GloVe: Global Vectors for Word Representation. Pennington et al. 2014
Subword embeddings
CNN character embedding layer - Character-Aware Neural Language Models, Kim et al. 2015
FastText - Enriching Word Vectors with Subword Information, Bojanowski et al. 2017
WordPiece - Google’s NMT System: Bridging the Gap between Human and Machine Translation, Wu et al. 2016
Contextualized embeddings
ELMo - Deep contextualized word representations, Peters et al. 2018
CoVe - Learned in Translation: Contextualized Word Vectors, McCann et al. 2017
Pre-trained deep learning architecture
Transformer - Attention Is All You Need, Vaswani et al. 2017
OpenAI GPT - Improving language understanding with unsupervised learning, Radford et al. 2018
BERT - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018
OpenAI GPT-2 - Language Models are Unsupervised Multitask Learners, Radford et al. 2018

8th March 2018
Thank you!
tdelteil@amazon.com
github.com/ThomasDelteil
twitter.com/thdelteil

8th March 2018
GluonNLP Toolkit

Machine Learning Scientist at AWS Shares Recent Advances in NLP

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Machine Learning Scientist at AWS Shares Recent Advances in NLP

Similar to Machine Learning Scientist at AWS Shares Recent Advances in NLP (20)

More from Apache MXNet

More from Apache MXNet (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Scientist at AWS Shares Recent Advances in NLP

Editor's Notes