SlideShare a Scribd company logo
Introduction to Transformers
for NLP
where we are and how we got here
Olga Petrova
AI Product Manager
DataTalks.Club
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
www.olgapaints.net
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Representing words
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
NLP 101 RNNs Attention Transformers BERT
Representing words
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
I play with my dog.
● Machine learning: inputs are vectors / matrices
● How do we convert “I play with my dog” into numerical values?
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
0
0
0
0
0
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
● One hot encoding
● Dimensionality = vocabulary size
cat
0
0
0
0
0
0
1
1
2
3
4
5
6
7
NLP 101 RNNs Attention Transformers BERT
play played playing
I play / played / was playing with my dog.
Word = token: , ,
Subword = token: , ,
E.g.: played = ;
● In practice, subword tokenization is used most often
● For simplicity, we’ll go with word = token for the rest of the talk
Tokenization
play #ed #ing
play #ed
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings N = vocabulary size
● puppy : dog = kitten : cat. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
1 for feline
1 for canine - = -
1 for adult / 0 for baby
dog puppy cat kitten
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
● Embedding matrix: through machine learning!
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
In practice:
● Embedding can be part of your
model that you train for a ML task
● Or: use pre-trained embeddings
(e.g. Word2Vec, GloVe, etc)
NLP 101 RNNs Attention Transformers BERT
Tokenization:
blog.floydhub.com/tokenization-nlp/
Word embeddings:
jalammar.github.io/illustrated-word2vec/
Further reading
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Takeaways:
1. Multiple elements in a sequence
2. Order of the elements matters
RNNs: Recurrent Neural Networks
NLP 101 RNNs Attention Transformers BERT
RNNs: Recurrent Neural Networks
What does this mean???
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
Output sequence
Input sequence
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
Does this really work?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.” 😱
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Decoder determines which input elements get
attended to via a softmax layer
Further reading
NLP 101 RNNs Attention Transformers BERT
jalammar.github.io/illustrated-transformer/
My blog post that this talk is based on + the linear algebra perspective on
Attention:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3
6c0265937d
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Attention is All You Need © Google, 2017
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
○ Also Self-Attention in the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Word embedding + position encoding
Attention all around
● All of the sequence elements (w1, w2,
w3) are being processed in parallel
● Each element’s encoding gets
information (through hidden states)
from other elements
● Bidirectional by design 😍
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
Uni-directional RNN would have missed
this! Bi-directional is nice.
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;)
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
Task # 2: Next Sentence Prediction
● Input sequence: two sentences separated by a special token
● Binary classification: are these sentences successive or not?
NLP 101 RNNs Attention Transformers BERT
Further reading and coding
My blog posts:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0
265937d
blog.scaleway.com/understanding-text-with-bert/
The HuggingFace 🤗 Transformers library: huggingface.co/
NLP 101 RNNs Attention Transformers BERT

More Related Content

What's hot

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
Hady Elsahar
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
Artifacia
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
Suman Debnath
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Edureka!
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
Deep Learning Italia
 
Bert
BertBert

What's hot (20)

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Bert
BertBert
Bert
 

Similar to Introduction to Transformers for NLP - Olga Petrova

Kaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionKaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solution
ArtsemZhyvalkouski
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
H K Yoon
 
BERT.pptx
BERT.pptxBERT.pptx
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
ParrotAI
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
Suman Debnath
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
KaiduTester
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Databricks
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
Anuj Gupta
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
Plain Concepts
 
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
MobileMonday Estonia
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
PyData
 
Unit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptxUnit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptx
LikithaAk
 

Similar to Introduction to Transformers for NLP - Olga Petrova (20)

Kaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionKaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solution
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
BERT.pptx
BERT.pptxBERT.pptx
BERT.pptx
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Unit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptxUnit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptx
 

More from Alexey Grigorev

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Alexey Grigorev
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Alexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
Alexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
Alexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Alexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
Alexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Alexey Grigorev
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
Alexey Grigorev
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
Alexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Alexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
Alexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
Alexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
Alexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
Alexey Grigorev
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
Alexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
Alexey Grigorev
 
ML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and Logistics
Alexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 
ML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and Logistics
 

Recently uploaded

How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 

Recently uploaded (20)

How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 

Introduction to Transformers for NLP - Olga Petrova

  • 1. Introduction to Transformers for NLP where we are and how we got here Olga Petrova AI Product Manager DataTalks.Club
  • 2. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France
  • 3. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway
  • 4. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University
  • 5. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University www.olgapaints.net
  • 6. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 7. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 8. Representing words NLP 101 RNNs Attention Transformers BERT
  • 9. Representing words I play with my dog. NLP 101 RNNs Attention Transformers BERT
  • 10. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices NLP 101 RNNs Attention Transformers BERT
  • 11. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 NLP 101 RNNs Attention Transformers BERT
  • 12. Representing words ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 I play with my dog. ● Machine learning: inputs are vectors / matrices ● How do we convert “I play with my dog” into numerical values? NLP 101 RNNs Attention Transformers BERT
  • 13. Representing words I play with my dog. I play with my dog . NLP 101 RNNs Attention Transformers BERT
  • 14. Representing words I play with my dog. I play with my dog . 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 0 0 0 0 0 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 15. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 16. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 ● One hot encoding ● Dimensionality = vocabulary size cat 0 0 0 0 0 0 1 1 2 3 4 5 6 7 NLP 101 RNNs Attention Transformers BERT
  • 17. play played playing I play / played / was playing with my dog. Word = token: , , Subword = token: , , E.g.: played = ; ● In practice, subword tokenization is used most often ● For simplicity, we’ll go with word = token for the rest of the talk Tokenization play #ed #ing play #ed NLP 101 RNNs Attention Transformers BERT
  • 18. Word embeddings = …… N = …… = …… = …… One hot encodings N = vocabulary size ● puppy : dog = kitten : cat. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 19. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 20. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 21. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 22. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 23. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy 1 for feline 1 for canine - = - 1 for adult / 0 for baby dog puppy cat kitten NLP 101 RNNs Attention Transformers BERT
  • 24. How do we arrive at this representation? Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 25. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 26. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 27. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 28. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process ● Embedding matrix: through machine learning! Word embeddings Embedding matrix W One-hot vector Embedded vector In practice: ● Embedding can be part of your model that you train for a ML task ● Or: use pre-trained embeddings (e.g. Word2Vec, GloVe, etc) NLP 101 RNNs Attention Transformers BERT
  • 30. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 31. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT
  • 32. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1
  • 33. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant
  • 34. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant Takeaways: 1. Multiple elements in a sequence 2. Order of the elements matters
  • 35. RNNs: Recurrent Neural Networks NLP 101 RNNs Attention Transformers BERT
  • 36. RNNs: Recurrent Neural Networks What does this mean??? NLP 101 RNNs Attention Transformers BERT
  • 37. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 38. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 39. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 40. RNNs: the seq2seq example Output sequence Input sequence NLP 101 RNNs Attention Transformers BERT
  • 41. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 42. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 43. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 44. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 45. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 46. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 47. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings)
  • 48. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 49. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 50. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 51. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 52. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 53. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 54. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 55. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 56. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 57. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 58. Does this really work? NLP 101 RNNs Attention Transformers BERT
  • 59. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 NLP 101 RNNs Attention Transformers BERT
  • 60. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? NLP 101 RNNs Attention Transformers BERT
  • 61. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 NLP 101 RNNs Attention Transformers BERT
  • 62. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? NLP 101 RNNs Attention Transformers BERT
  • 63. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” NLP 101 RNNs Attention Transformers BERT
  • 64. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ NLP 101 RNNs Attention Transformers BERT
  • 65. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” NLP 101 RNNs Attention Transformers BERT
  • 66. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” 😱 NLP 101 RNNs Attention Transformers BERT
  • 67. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 68. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT
  • 69. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 70. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 71. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one
  • 72. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one The Decoder determines which input elements get attended to via a softmax layer
  • 73. Further reading NLP 101 RNNs Attention Transformers BERT jalammar.github.io/illustrated-transformer/ My blog post that this talk is based on + the linear algebra perspective on Attention: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3 6c0265937d
  • 74. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 75. Attention is All You Need © Google, 2017 NLP 101 RNNs Attention Transformers BERT
  • 76. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs NLP 101 RNNs Attention Transformers BERT
  • 77. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers NLP 101 RNNs Attention Transformers BERT
  • 78. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder NLP 101 RNNs Attention Transformers BERT
  • 79. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention NLP 101 RNNs Attention Transformers BERT
  • 80. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention ○ Also Self-Attention in the Decoder NLP 101 RNNs Attention Transformers BERT
  • 81. Attention all around NLP 101 RNNs Attention Transformers BERT
  • 82. Attention all around NLP 101 RNNs Attention Transformers BERT Word embedding + position encoding
  • 83. Attention all around ● All of the sequence elements (w1, w2, w3) are being processed in parallel ● Each element’s encoding gets information (through hidden states) from other elements ● Bidirectional by design 😍 NLP 101 RNNs Attention Transformers BERT
  • 84. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” NLP 101 RNNs Attention Transformers BERT
  • 85. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” Uni-directional RNN would have missed this! Bi-directional is nice. NLP 101 RNNs Attention Transformers BERT
  • 86.
  • 87. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 88. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words NLP 101 RNNs Attention Transformers BERT
  • 89. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) NLP 101 RNNs Attention Transformers BERT
  • 90. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token NLP 101 RNNs Attention Transformers BERT
  • 91. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 92. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 93. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers NLP 101 RNNs Attention Transformers BERT
  • 94. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;) NLP 101 RNNs Attention Transformers BERT
  • 95. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing NLP 101 RNNs Attention Transformers BERT
  • 96. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing Task # 2: Next Sentence Prediction ● Input sequence: two sentences separated by a special token ● Binary classification: are these sentences successive or not? NLP 101 RNNs Attention Transformers BERT
  • 97. Further reading and coding My blog posts: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0 265937d blog.scaleway.com/understanding-text-with-bert/ The HuggingFace 🤗 Transformers library: huggingface.co/ NLP 101 RNNs Attention Transformers BERT