The document provides an overview of recent advances in natural language processing (NLP), including traditional methods like bag-of-words models and word2vec, as well as more recent contextualized word embedding techniques like ELMo and BERT. It discusses applications of NLP like text classification, language modeling, machine translation and question answering, and how different models like recurrent neural networks, convolutional neural networks, and transformer models are used.
5. David Gascoyne
Automatic Text Generation
http://botpoet.com
The crow crooked on more beautiful and free,
He journeyed off into the quarter sea.
His radiant ribs girdled empty and very
least beautiful as dignified to see.
The smooth plain with its mirrors listens to the cliff
Like a basilisk eating flowers.
And the children, lost in the shadows of the catacombs,
Call to the mirrors for help:
“Strong-bow of salt, cutlass of memory,
Write on my map the name of every river.”
6. Natural Language Understanding
Alexa, remind me to
buy groceries after work
Intent detection:
Create Reminder
Slot filling:
What
When
Where
Alexa, remind me to
buy groceries after work
7. Machine Translation
Sometimes, in the morning, I wonder whether AI bots will kill us all
時々、午前中に、AIボットが私たち全員を殺すのだろうか?
Text Summarization
A Neural Attention Model for Abstractive Sentence Summarization, Alexander M. Rush et al. 2015
8. Question Answering:
“Who was president when Barack Obama was born?”
John Fitzgerald Kennedy
Part of speech tagging
Sentence similarity
Commonsense Reasoning
Coreference Resolution
…
10. Text Pre-Processing
I’d love to drive again in the mountainous roads of Crete.
I would love to drive again in the mountainous roads of crete.
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
would · love · drive · again · mountainous · roads · crete · .
would · love · drive · again · mountain · road · crete · .
Normalization Tokenization Stop words removal Lemmatization
11.
12. Grapheme/Token representation: One-Hot encoding
Define words as a vector
I’d love to drive … preprocessing
drive
love
would
would love drive1
0
0
0
1
0
0
0
1
13. Sentence representation: Bag of words
Sum of one-hot encoded word vectors
I’d love to drive …
drive
love
would
I’d love to drive
Dictionary size = 3
If dictionary size >>> 1
Very sparse!
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
1
1
16. Word order matters
• Context dependent information
• The place of the word in the sentence matters
My kindle is easy to use,
I do not need help
I do need help, my kindle
is not easy to use
18. Word2vec: Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 13 2013
Learn word embeddings:
Skip-gram: predict context given center word
Continuous Bag of Words (CBOW): predict center word given context
CBOW model
… The cake is a lie …
Context words at
t-2 and t-1
Context words at
t+1 and t+2
Word to predict at t
Estimate: 𝑃(𝑤𝑡|𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡+2)
22. Using Word Representation in Neural Networks
Amazon is amazing
2910 79 1927
W2910 W79 W1927
W1
W2
…
Wi
…
W|V|
Neural Layers
?
Output Layer
{Wi} are the word embeddings. They are
parameters that the Neural Networks
can modify through. Can be pre-trained.
indexing
lookup
1
|V|
N
25. Convolutional Neural Network for Text Classification
Source: Character-level Convolutional Networks for Text Classification,
Zhang et al. 15
Embeddings
Time
Time
34. Limitations
• Rare words are not well represented or just <UNK>
Half-way solutions:
• fastText and sum of subwords embeddings
• Character ngrams
• Byte Pair Encoding (BPE)
35. Limitations
Polysemy: meaning of a word
• Java
• Python
Depends on the context
• I love travelling. I am going to explore Java.
https://en.wikipedia.org/wiki/Java
36. Context can be bidirectional:
I went to the bank, to drop off some money
Context can be bidirectional:
I went to the bank, to drop off some money
Limitations
39. ELMo Embeddings (Peters et al. 18)
Contextualized word embeddings
Fine-Tuning:
Task Specific Neural Network
𝑅1 𝑅2 𝑅3 𝑅 𝑛
Learnt linear combination of hidden states
Embedding (Char-CNN)
40. ELMo Embeddings (Peters et al. 18)
ELMo
Softmax Layer
Input sentence
Output (probabilities over V)
Pretraining on Language Model
Your task NN
Input sentence
Output
Train on your Task
41. BERT (Devlin et al. 18)
Bidirectional Encoder Representations from Transformers (BERT)
42. BERT (Devlin et al. 18)
Inspired by “Improving Language Understanding by Generative Pre-Training”,
Radford et al. 2018 GPT-1 OpenAI
Based on Transformer and the multi-head self-attention model
Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
43. BERT (Devlin et al. 18)
Self-attention: “The apple is red, it is delicious”
The apple is red , it is delicious
The apple is red , it is delicious
The apple is red , it is delicious
44. BERT (Devlin et al. 18)
INPUT
WordPieces
Embeddings
Sentence
Embeddings
Position
Embeddings
BERT INPUT REPRESENTATION:
Learned during the
(pre)training
process
MASK
EMASK
In pre-training 15% of the input tokens
are masked for the masked LM task
45.
46.
47. Training objects in slightly modified BERT models for downstream
tasks. (Image source: original paper)
Fine-tuning
For classification tasks:
token 𝐶𝐿𝑆 , 𝒉 𝐿
[𝐶𝐿𝑆]
Small weight matrix W:
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒉 𝐿
𝐶𝐿𝑆
𝑾 𝐶𝐿𝑆)
BERT (Devlin et al. 18)
48. - No need for custom Neural Network for fine-tuning
BERT (Devlin et al. 18)
49. XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Problems with Bert:
1. The [MASK] token used in training does not appear during fine-tuning
2. BERT generates predictions independently
I went to [MASK] [MASK] and saw the [MASK] [MASK] [MASK].
50. XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Bidirectionnal context through randomized prediction of ordered tokens
51. OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
Source: original paper
Trained on language model task
40GB corpus dataset
1.5B parameters
52. OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
All the downstream language tasks are framed as predicting conditional
probabilities and there is no task-specific fine-tuning.
Zero-shot learning:
Summarization:
𝑃 𝑤 “𝑡𝑒𝑥𝑡 𝑡𝑜 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒” + “ 𝑇𝐿; 𝐷𝑅: <? > ”)
Question Answering:
𝑃 𝑤 “𝑡𝑒𝑥𝑡” + “𝑄: … 𝐴: … 𝑄: … 𝐴: <? > ”)
Machine Translation:
𝑃 𝑤 “𝐼 𝑙𝑖𝑘𝑒 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟𝑠 = 𝐽′ 𝑎𝑖𝑚𝑒 𝑙𝑒𝑠 𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑢𝑟𝑠; 𝐼 𝑙𝑖𝑣𝑒 𝑖𝑛 𝑉𝑎𝑛𝑐𝑜𝑢𝑣𝑒𝑟 =<? > ”)
56. Conclusion
- Count-based word representation (tf-idf)
- Learnt word representation (word2vec)
- Contextualized embeddings + custom network (ELMo)
- Sentence embeddings + fine-tuning (BERT, XLNet)
- Zero-short transfer with large language model (GPT-2)
Language representation
Task specific adaptation
57. References
Word embeddings:
LSA - Indexing by latent semantic analysis, Dumais et al. 1990
Word2Vec - Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013
GloVe - GloVe: Global Vectors for Word Representation. Pennington et al. 2014
Subword embeddings
CNN character embedding layer - Character-Aware Neural Language Models, Kim et al. 2015
FastText - Enriching Word Vectors with Subword Information, Bojanowski et al. 2017
WordPiece - Google’s NMT System: Bridging the Gap between Human and Machine Translation, Wu et al. 2016
Contextualized embeddings
ELMo - Deep contextualized word representations, Peters et al. 2018
CoVe - Learned in Translation: Contextualized Word Vectors, McCann et al. 2017
Pre-trained deep learning architecture
Transformer - Attention Is All You Need, Vaswani et al. 2017
OpenAI GPT - Improving language understanding with unsupervised learning, Radford et al. 2018
BERT - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018
OpenAI GPT-2 - Language Models are Unsupervised Multitask Learners, Radford et al. 2018
58. 8th March 2018
Thank you!
tdelteil@amazon.com
github.com/ThomasDelteil
twitter.com/thdelteil
Use softmax cross entropy loss
Minimize – log p probability of the context world
Unsupervised learning process from large corpora
Task 1: Mask language model (MLM)
Task 2: Next sentence prediction
Note that the first token is always forced to be [CLS] — a placeholder that will be used later for prediction in downstream tasks.