Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

Transformers
State Of The Art Natural Language Processing
Nilesh Verma
Full Stack Data Scientist
(Amlgo Labs India)

Agenda
• Recent Development on NLP
• Short History of NLP
• Why Transformers
• Transformer & Their Architecture
• Attention Mechanism
• Workings
• Types of Transformers
• Explain BERT
• Popular State of the Art language Model

Who I am
• I am Nilesh Verma
• Full Stack Data Scientist at Amlgo Labs India. Ex- Xceedance, Samsung AI
• Having 2+ Years of Industry Experience.
• AutoWave for Audio Classification, DeepImageSearch, and DeepTextSearch are some of the
interesting python libraries (open source contributions) that I developed and maintain.
• More then 30K-40K+ Downloads.
• Secured 1st rank in The Great Indian Hiring Hackathon (Nov-20) based on Foretelling the Retail
Price Host by MachineHack.
• Recognition of being placed 3rd in AppScript, A 48-Hours Hackathon Conducted by IEEE APSIT on
6-7th Feb 2021.
• Clear NTA-NET,GATE exam in first attempt.
• B.Sc. And M.Sc. Computer Science (Gold Medalist)
• Various state-level news cover for the development of real-time covid-19 detection through CT-
Scan software.

Short History of NLP
• 1954 - Bag of Words (BoW)
• 1972 - TF-IDF
• 2001 - Neural language models (RNN,B-RNN,LSTM)
• 2008 - Multi-Task learning
• 2013 - Word embeddings (Word2Vec)
• 2013 - Neural networks for NLP
• 2014 - Sequence to sequence models(Encoder-Decoder)
• 2015 - Attention (For images but found useful for Text too)
• 2017 - Transformer
• 2018 - Pretrained language models(BERT,GPT ,T5 etc)

Why Transformer
• Improve Contextual Understanding
• Parallelization (Faster Processing/Utilization of GPU/TPU Power)

What is Transformer
The Transformer in NLP is a novel architecture that aims to solve sequence-to-
sequence tasks while handling long-range dependencies with ease. It relies entirely
on self-attention to compute representations of its input and
output WITHOUT using sequence-aligned RNNs or convolution.

Transformer Architecture Breakdown
• we see an encoding component, a decoding component, and
connections between them.

• The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s
nothing magical about the number six, one can definitely experiment with other arrangements). The
decoding component is a stack of decoders of the same number.

• The encoder’s inputs first flow through a self-attention layer
• The outputs of the self-attention layer are fed to a feed-forward neural network.
• The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence

Input Preprocessing
• Each word is embedded into a vector of size 512. We'll represent
those vectors with these simple boxes. The embedding only happens
in the bottom-most encoder.

Input Preprocessing
• To give the model a sense of the order of the words, we add positional encoding
vectors -- the values of which follow a specific pattern.
• Real example of positional encoding with a toy embedding size of 4

Input Preprocessing
Here “pos” refers to the
position of the “word” in
the sequence “d” means
the size of the
word/token embedding.
Finally, “i” refers to each
of the individual
dimensions of the
embedding (i.e. 0, 1,2,3,4)

Encoder
• The word at each position passes through a self-attention process. Then, they
each pass through a feed-forward neural network -- the exact same network with
each vector flowing through it separately.

Self-Attention
• Attention allowed us to focus on parts of our input sequence while we predicted
our output sequence
“Self attention, sometimes called intra-attention is an attention mechanism
relating different positions of a single sequence in order to compute a
representation of the sequence.”

Self-Attention in Detail
• The first step in calculating self-attention is to create a Query vector, a Key vector, and a Value vector. These vectors are
created by multiplying the embedding by three matrices that we trained during the training process.
• Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. This is an
architecture.
Multiplying x1 by
the WQ weight matrix
produces q1, the "query"
vector associated with that
word. We end up creating a
"query", a "key", and a
"value" projection of each
word in the input sentence.
What are the “query”, “key”,
and “value” vectors?

• The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first
word in this example, “Thinking”. We need to score each word of the input sentence against this word.
The score is calculated by taking the dot
product of the query vector with the key
vector of the respective word we’re
scoring. So if we’re processing the self-
attention for the word in position #1,
the first score would be the dot product
of q1 and k1. The second score would be
the dot product of q1 and k2.

• The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the
paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default),
then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1.
This SoftMax score determines how
much each word will be expressed at
this position. Clearly the word at this
position will have the highest SoftMax
score, but sometimes it’s useful to
attend to another word that is relevant
to the current word.

• The fifth step is to multiply each value vector by
the SoftMax score (in preparation to sum them up).
The intuition here is to keep intact the values of the
word(s) we want to focus on, and drown-out
irrelevant words (by multiplying them by tiny
numbers like 0.001, for example).
• The sixth step is to sum up the weighted value
vectors. This produces the output of the self-
attention layer at this position (for the first word).

Matrix Calculation of Self-Attention
The first step is to calculate the Query, Key, and Value
matrices. We do that by packing our embeddings into a
matrix X, and multiplying it by the weight matrices we’ve
trained (WQ, WK, WV).
Finally, since we’re dealing with matrices, we can
condense steps two through six in one formula to
calculate the outputs of the self-attention layer.

The Beast With Many Heads
• The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves
the performance of the attention layer in two ways:
• It expands the model’s ability to focus on different positions.
• It gives the attention layer multiple “representation subspaces”. so we end up with eight sets for each
encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input
embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

The Beast With Many Heads
As we encode the word "it", one attention
head is focusing most on "the animal", while
another is focusing on "tired" -- in a sense, the
model's representation of the word "it" bakes
in some of the representation of both "animal"
and "tired".

The Residuals
• One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-
layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer-
normalization step.

Layer Normalization
If we’re to visualize the vectors and the layer-norm
operation associated with self attention, it would look like
this:

Final Linear and SoftMax Layer
1. The decoder stack outputs a vector of floats.
How do we turn that into a word That’s the job
of the final Linear layer which is followed by a
SoftMax Layer.
2. The Linear layer is a simple fully connected
neural network that projects the vector
produced by the stack of decoders, into a
much, much larger vector called a logits vector.
3. Let’s assume that our model knows 10,000
unique English words (our model’s “output
vocabulary”) that it’s learned from its training
dataset. This would make the logits vector
10,000 cells wide – each cell corresponding to
the score of a unique word. That is how we
interpret the output of the model followed by
the Linear layer.
4. The SoftMax layer then turns those scores
into probabilities (all positive, all add up to
1.0). The cell with the highest probability is
chosen, and the word associated with it is
produced as the output for this time step.

Combined All
• This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked
encoders and decoders, it would look something like this:

Transformers are everywhere!
• Transformer models are used to solve all kinds of NLP tasks.
1. Feature Extraction (Get The Vector Representation Of A Text)
2. Fill-mask (Next Word Predication)
3. NER (Named Entity Recognition)
4. Question-Answering
5. Sentiment-Analysis
6. Summarization
7. Text-Generation
8. Translation
9. Zero-Shot-Classification
• The companies and organizations using Transformer models

A bit of Transformer history
Here are some reference points in the (short) history of Transformer models:

A bit of Transformer history
The Transformer architecture was introduced in June 2017. The focus of the original research was
on translation tasks. This was followed by the introduction of several influential models, including:
• June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks
and obtained state-of-the-art results
• October 2018: BERT, another large pretrained model, this one designed to produce better
summaries of sentences (more on this in the next chapter!)
• February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly
released due to ethical concerns
• October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory,
and still retains 97% of BERT’s performance
• October 2019: BART and T5, two large pretrained models using the same architecture as the
original Transformer model (the first to do so)
• May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of
tasks without the need for fine-tuning (called zero-shot learning)

Types of Transformers
This list is far from comprehensive, and is just meant to highlight a few
of the different kinds of Transformer models. Broadly, they can be
grouped into three categories:
Model Examples Tasks
Encoder ALBERT, BERT, DistilBERT, ELECTRA,
RoBERTa
Sentence classification, named
entity recognition, extractive
question answering
Decoder CTRL, GPT, GPT-2, Transformer XL Text generation
Encoder-decoder BART, T5, Marian, mBART Summarization, translation,
generative question answering

Transformers are language models
• All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as
language models. This means they have been trained on large amounts of raw text in a self-
supervised fashion. Self-supervised learning is a type of training in which the objective is
automatically computed from the inputs of the model. That means that humans are not needed
to label the data!
• This type of model develops a statistical understanding of the language it has been trained on,
but it’s not very useful for specific practical tasks. Because of this, the general pretrained model
then goes through a process called transfer learning. During this process, the model is fine-tuned
in a supervised way — that is, using human-annotated labels — on a given task.

BERT (Bidirectional Encoder Representations
from Transformers)
• BERT is a Natural Language Processing Model proposed by
researchers at Google Research in 2018.
• Individual NLP tasks have traditionally been solved by individual
models created for each specific task. That is, until— BERT!
• BERT revolutionized the NLP space by solving for 11+ of the most
common NLP tasks (and better than previous models) making it the
jack of all NLP trades.
Fun Fact 😁: You interact with NLP (and likely BERT) almost every single day!

Example of BERT
• BERT helps Google better surface (English) results for nearly all searches since
November of 2020.
• Here’s an example of how BERT helps Google better understand specific searches
like:

BERT’s Architecture
Transformer Layers Hidden Size Attention Heads Parameters Processing Length of Training
BERT-base 12 768 12 110M 4 TPUs 4 days
BERT-large 24 1024 16 340M 16 TPUs 4 days

How does BERT Work?
1. Large amounts of training data:
• A massive dataset of 3.3 Billion words has contributed to BERT’s continued success.
• BERT was specifically trained on Wikipedia (~2.5B words) and Google’s Books-Corpus (~800M words). These
large informational datasets contributed to BERT’s deep knowledge not only of the English language but also
of our world! 🚀

How does BERT Work?
2. Masked Language Model:
• MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing
BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This
had never been done before!

How does BERT Work?
3. Next Sentence Prediction:
• NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by
predicting if a given sentence follows the previous sentence or not.

Training Inputs
1. We give inputs to BERT using the above structure. The input consists of a pair of sentences, called
sequences, and two special tokens: [CLS] and [SEP].
2. BERT first uses wordpiece tokenization to convert the sequence into tokens and adds the [CLS]
token in the start and the [SEP] token in the beginning and end of the second sentence.

Training Inputs
Token Embeddings: Token embeddings by indexing a Matrix of size 30000x768(H). Here, 30000 is
the Vocab length after wordpiece tokenization. The weights of this matrix would be learned while
training.

Training Inputs
Segment Embeddings: For tasks such as question answering, we should specify which segment this
sentence is from. These are either all 0 vectors of H length if the embedding is from sentence 1, or
a vector of 1’s if the embedding is from sentence 2.

Training Output
we define two vectors S and E (which will be learned
during fine-tuning) both having shapes(1x768). We then
take a dot product of these vectors with the second
sentence’s output vectors from BERT, giving us some
scores. We then apply SoftMax over these scores to get
probabilities. The training objective is the sum of the log-
likelihoods of the correct start and end positions.

Pre-Training
“What is language? What is context?”

Fine-Training
“How to use language for specific task?”

GLUE Benchmark
• GLUE (General Language Understanding Evaluation) benchmark is a group of resources for
training, measuring, and analyzing language models comparatively to one another. These
resources consist of nine “difficult” tasks designed to test an NLP model’s understanding.

GPT (Generative Pre-trained Transformer)
• OpenAI GPT model was proposed in Improving Language Understanding by Generative Pre-
Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal
(unidirectional) transformer pre-trained using language modelling on a large corpus will long
range dependencies, the Toronto Book Corpus.

T5(Text-To-Text Transfer Transformer)
• T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a
text-to-text approach. Every task – including translation, question answering, and
classification – is cast as feeding the model text as input and training it to generate some
target text. T5 uses common crawl web extracted text.

References
• https://jalammar.github.io/illustrated-transformer/
• https://www.analyticsvidhya.com/blog/2019/06/understanding-
transformers-nlp-state-of-the-art-models/
• https://towardsdatascience.com/transformers-89034557de14
• https://www.youtube.com/watch?v=TQQlZhbC5ps&t=60s
• https://arxiv.org/abs/1706.03762

Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing

Similar to Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing (20)

Recently uploaded

Recently uploaded (20)

Demystifying NLP Transformers: Understanding the Power and Architecture behind Natural Language Processing