2. What is Word Embedding?
• Natural language processing (NLP) models do not work with plain text. So, a numerical
representation was required.
• Word embedding is a class of techniques where word is represented as a real value
vectors.
• It is a representation of word in a continuous vector space.
• It is a dense representation in a vector space.
• It can be represented in smaller dimension compared to sparse representation like one-
hot encoding.
• Most of the word embedding method is based on “distributional hypothesis” by Zelling
Harris.
3. What is word embedding? continued
• The Distributional Hypothesis is that words that occur in the same contexts tend to have
similar meanings. (Harris, 1954)
• Word embeddings are designed to capture the similarity between representation like:
meaning, morphology, context etc.
• The captured relationship helps us to work on downstream NLP task like chat-bot, text
summarization, information retrieval etc.
• It is generated by co-occurrence matrix, dimensionality reduction and neural networks.
• It can be broadly categorized in two parts: frequency-based embeddings and prediction-
based embeddings.
• The earliest work to give a vector representation was vector space model used in
information retrieval task.
4. Vector space model
• A document was represented in a vector space.
• The dimensionality of vector space is of size of unique words in corpora.
Term 2
Doc 1
Doc 3
Term 1
Doc 2
Term 3
Term
1
Term
2
Term
3
Doc 1 0 5 5
Doc 2 2 0 1
Doc 3 3 3 0
• Hypothetical corpora with three
words represented as dimension.
• Three doc projected in the vector
space as per their term frequency
5. Vector space model continued
• The document got a numerical vector representation in a vector space represented by words.
• E.g.
• Doc 1 -> [0, 5, 5]
• Doc 2 -> [2, 0, 1]
• This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus
shoots up to millions.
• It is based on term frequency.
• TF-IDF normalization is applied to reduce the weightage of frequent words like ‘the’, ‘are’ , and etc.
• One-hot encoding is a similar technique to represent a sentence/document in vector space.
• This representation gather limited information and fails to capture the context of the word.
6. Co-occurrence matrix
• It is applied to capture the neighbouring word that appeared with the word under
consideration. A context window is considered to calculate co-occurrence.
• E.g.:
• India won the match. I like the match.
• Co-occurrence matrix for above two sentence for context window of 1.
India won the match I like
India 1 1 0 0 0 0
won 1 1 1 0 0 0
the 0 1 1 1 0 1
match 0 0 1 1 0 0
I 0 0 0 0 1 1
like 0 0 1 0 1 1
7. Co-occurrence matrix continued
• Representations like One-hot encoding, Count based method and co-occurrence matrix
based methods are very sparse in nature.
• Either context was limited or absent all together.
• Single representation for word in every context.
• Relation between two words like: semantic reasoning is not possible with this
representation.
• Context is limited but predetermined.
• Long term dependencies are not captured.
8. Prediction based word embeddings
• It is a method to learn dense representation of word from a very high dimensional
representation.
• It is a modular representation, where a sparse vector is fed to generate a dense
representation
Word
Word
embedding
Model
[0, 1, 0, .... 0]
One hot encoded representation - India
= [0, 1, 0, .... 0]
V(India) = [0.1, 2.3, -2.1, ...., 0.1]
9. Language modelling
• Word Embedding models are very closely related to Language modelling.
• Language modelling tries to learn a probability distribution over the words in Vocabulary (V)
• Prime task of language model to calculate the probability a word Wi given the previous (N-1)
words, mathematically 𝑃(𝑊𝑖|𝑊𝑖−1, .. .𝑊𝑖−𝑛+1)
• Probabilities over n-gram is calculated by frequency by constituent n-gram.
• In Neural network as well we achieve the same using softmax layer.
• We calculate the log probability of 𝑊𝑖 and normalize it with the sum of the probablities over all the words.
𝑖−1
𝑖 𝑖−𝑛+1
• 𝑃(𝑊 |𝑊 , . . . 𝑊 ) =
𝑊𝑖
𝑒𝑥𝑝(ℎ𝑇𝑉’ )
∑𝑊 𝑒𝑥𝑝(ℎ𝑇𝑉’ )
𝑖∈𝑉 𝑊𝑖
𝑊
𝑊𝑖
• In this case, h is the representation from hidden layer and 𝑉𝑖
is the embedding of the word.
• The inner product of ℎ𝑇
𝑉’
𝑖
generate the log probability of word 𝑊 .
10. Classical Neural language model
• It was proposed by Bengio et al., 2003
• It consists of one layer feed-forward neural
network to predict next word in sequence.
• The model tries to maximize the probability
as computed by softmax.
• Bengio et.al. introduced three concepts
• Embedding layer: a layer that generates
word embeddings by multiplying an index
vector with a word embedding matrix.
11. Classical Neural language model continued
• Intermediate layers: One or more layers that produce an intermediate representation of
the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of
word embeddings of 𝑛 previous word
• Softmax Layer: the final layer that produces a probability distribution over words in V
• Intermediate layer can be replaced with LSTM.
• The network has computational complexity bottleneck due to softmax layer, in which
probability over the set of vocabulary needs to be computed.
• Neural based work embedding model made a significant progress with Word2vec model
proposed by Mikolov et.al. in 2013
12. Word2Vec
• It was proposed by Mikolov et.al. in 2013.
• It is a two layer shallow neural network trained to learn the contextual relationship.
• It places contextually similar word near to each other.
• It is a co-occurrence based model.
• Two variants of the model was proposed
• Continuous bag of words model (CBOW)
• Given the context word, predict the center word
• Order of context words are not considered, so this representation is similar to BOW.
• Skip-gram model
13. What does context mean?
• Context is co-occurrence of the words. It is a sliding window around the word under the
consideration.
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
Window size = 2, Yellow patches are words are in consideration, orange box are the context window
14. CBOW continued
One hot vectors
• Goal: Predict the center word, given the context words.
softmax layer
𝑊𝑡−2
𝑊𝑡−1
𝑊𝑡+1
𝑊𝑡+2
C
Projection matrix of shape
V x D (to be learned)
𝑊𝑖 * P
Average of context vectors
Output projection matrix M
of dimension D X V
C.M
one-hot vector of 𝑊𝑡
cross entropy loss
15. CBOW continued
• One hot encoded of the context words 𝑊𝑡−2 , .... 𝑊𝑡+2 is input to the model.
• Projection matrix of shape Vx D, where is Vis the total no of unique words in the corpus and
D is dimension of the dense representation, to project one-hot encoded vector into D-
dimension vector.
• Averaged context vector is projected back to V-dimensional space. Softmax layer converts
the representation into proablities for 𝑊𝑡.
• The model is trained using cross-entropy loss between the softmax layer output and the
one- hot encoded representation of 𝑊𝑡.
16. Skip-gram model: High level
• Goal: To predict the context word 𝑊𝑡−2 , .... 𝑊𝑡+2 given the word 𝑊𝑡
one-hot vector of 𝑊𝑡 softmax layer
One hot vectors
𝑊𝑡−2
𝑊𝑡−1
𝑊𝑡+1
𝑊𝑡+2
C
Projection matrix of shape
V x D (to be learned)
C * M
Output projection matrix M
of dimension D X V
C.M
cross entropy loss
𝑊𝑡 𝑊𝑡 ∗ 𝑃
17. Skip-gram continued
• A end to end flow of training:
0
0
0
1
0
0
0
0
𝑊𝑡
- - 0.2 -
- - 0.1 -
- - 0.4 -
- - 0.8 -
- - 0.2 -
𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑚𝑎𝑡𝑟𝑖𝑥𝑇
Vx1
DxV
0.2
0.1
0.4
0.8
0.2
Dx1
- - 0.8 - -
- - 0.1 - -
- - 0.4 - -
- - 0.8 - -
- - 0.2 - -
- - 0.9 - -
- - 0.2 - -
- - 0.6 - -
Context Maxtrix
(Shared with all
context word
prediction)
𝑜
VxD (𝑢𝑇)
0.1
0.2
1.3
0.4
0
.6
.7
.8
𝑡−1
Vx1 (𝑊 )
𝑢𝑇𝑉
𝑜 𝑐
𝑉𝑐
0.076
0.084
0.252
0.102
0.068
0.125
0.138
0.153
Softmax
0
0
0
0
1
0
0
0
Ground
truth
𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑢𝑜 𝑉𝑐)
This representation is taken from the Lectures of Manning on YouTube
×
= =
softmax
18. Skip-gram continued
• It focus on optimization of loss for each word:
• 𝑃 𝑜|𝑐 = 0
exp 𝑢𝑇𝑣𝑐
𝑘=1
∑𝑉
𝑤
exp 𝑢𝑇 𝑣𝑐
• It calculates probability of output context word given the center word c.
• The loss function which it tries to minimize is:
• Log value is calculated similar to depicted in first equation.
• Naive Training is costly because gradient calculation is of order 𝑉
• Two computationally efficient methods are proposed:
• Hierarchical Softmax
• Negative sampling
𝑇 𝑡=1
𝐽 𝜃 = − 1
∑𝑇 ∑−𝑚 ≤𝑗≤𝑚 𝑡+𝑗 𝑡
𝑗≠0
𝑙𝑜𝑔𝑝(𝑊 /𝑊 )
19. Skip-gram continued
• Use of negative sampling is to train is more prevalent.
• In early example, tuple like (India, the), (India, now) are the example to true cases.
• Any corrupted tuple is called as negative sample. like (India, reliant), (India, state)
• This process with modified objective function results it in a logistic regression to classify a
tuple as a true combination or a corrupt ones.
• The corrupt tuple is generated by sampling such that less frequent words are picked up
more often as a corrupt tuple.
20. Word Embedding visualization
Top 5 similar words
crude barrel 0.548
crude oteiba 0.464
crude netback 0.45
crude refinery 0.438
crude pipeline 0.421
ship vessel 0.623
ship port 0.575
ship tanker 0.496
ship navigation 0.471
ship crane 0.463
computer software 0.602
computer micro 0.559
computer printer 0.542
computer mainframe 0.538
computer hemdale 0.527
• Even with a smaller corpus it can
capture semanticallly relevant
words.
t-SNE 2. D projection of Word2vec (gensim implementation) embeddings of top 10 similar words, trained for 50 epoch on Reuters news
corpus from NLTK, with context len 15, vector dimension 100
22. Analogies
• Representation of analogy in vector space using word2vec vectors:
• Vector representation of “King-man+woman” is
roughly equivalent to the vector representation
of queen
• Using gensim and pretrained word2vec the
analogy vector generated for “King-
man+woman” 5 most closer relationships are
queen 0.7118
monarch 0.619
princess 0.5902
crown_prince 0.55
prince 0.54
Image taken from https://jalammar.github.io/illustrated-word2vec/
23. GloVe: Global Vectors for Word Representation
• It is a unsupervised learning to learn the word representation.
• It is based on co-occurrence matrix.
• The co-occurrence matrix built on the whole corpus.
• It is able to capture global context.
• It encompass best of the two model families:
• Local context window method
• Global matrix factorization
• Earlier matrix factorization like LSA was used to reduce the dimensionality.
• Two things that GloVe model captures:
• Statistical measure using co-occurrence matrix
• Context, by considering the neighbouring words
24. Glove continued
• It moves away from old matrix factorization. By considering relationship reasoning
(Semantic and syntactic), GloVe tries to learn the representation for words.
• It can be represented as:
Word co-occurrence matrix
Word feature
matrix
(Embedding
matrix)
Feature context matrix
*
=
words
context
features context
words
features
25. GloVe continued
• How does GloVe learns embedding?
• It considers word-word co-occurrence probabilities as the potential of relation between words.
• The authors presented a relation with “steam” and “ice” as target words.
• It is common to consider steam occur with gas and ice with solid.
• Other co-occur words are “water” and “fashion”. “Water” has some shared property while “fashion” is
irrelevant.
• Only in the ratio of probabilities cancels out the noisy words like “water” and “fashion”.
• As presented in the above table, the ratio of probabilities are maximum for 𝑃(𝑘/𝑖𝑐𝑒) /𝑃(𝑘/𝑠𝑡𝑒𝑎𝑚) is
high for solid and small for gas.
26. GloVe continued
• What is the optimization function for GloVe?
• In a co-occurrence matrix 𝑋 the 𝑋𝑖𝑗 represents the co-occurrence count.
• 𝑋𝑖 is the total number of times the word appears in the context.
• 𝑃𝑖𝑗 = 𝑃(𝑗/𝑖) = 𝑋𝑖𝑗/𝑋𝑖 is the probability of word j appear in the context of word 𝑋𝑖
• For a combination of three words 𝑖, 𝑗, 𝑘. Ageneral representation of the model is
• The optimization function proposed by authors are:
𝑉
˜
𝑖 𝑗 𝑘
𝐹(𝑊 ,𝑊 ,𝑊 ) =
𝑃𝑖𝑘
𝑃𝑗𝑘
𝑇 ˜
𝐽 = ∑ 𝑓(𝑋𝑖𝑗)(𝑊𝑖 𝑊𝑗 + 𝑏𝑖 + 𝑏𝑗 − log 𝑋𝑖𝑗)
𝑖,𝑗=1
2
27. Glove Continued
𝑖 𝑖
• Where 𝑉 is the size of vocabulary and 𝑊𝑇
and 𝑏 is the vector and bias of the word 𝑊𝑖
and 𝑊
˜ 𝑗 and 𝑏𝑗 is the context vector and its bias. The last term is the probability of
occurring i in the context of j.
• The function 𝑓(𝑋) should have following properties:
• It tends to zero at when 𝑋 →0
• It should be non-decreasing so that it can discriminate rare co-occurrence instances.
• It should not overweight frequent co-occurrence.
• The choice of 𝑓(𝑋) is
𝑓(𝑋) = (𝑋/𝑋𝑚𝑎𝑥)𝛼
𝑖𝑓 𝑋 < 𝑋𝑚𝑎𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1
28. Glove Continued
• The model has following computational bottlenecks:
• Creating a big co-occurrence matrix of size 𝑉𝑋 𝑉.
• The model computational complexity depends on the number of non-zero elements.
• During training the context window needs to be sufficiently large so that the model can
distinguish left context and right context.
• Words which are more distant to each other contribute less in the count. Because, distant
words contribute less to the relationship of the words.
• The model generates two set of 𝑊 𝑎𝑛𝑑 ˜
𝑊. An average of both is used as the representation
of
words.
29. Glove results
t-SNE 2D projection of Glove embeddings of top 5 similar words, trained for 50 epoch on Reuters news corpus from
NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
crude barrel 0.752
crude posting 0.58
crude raise 0.537
crude light 0.505
crude sour 0.502
ship loading 0.58
ship kuwaiti 0.54
ship missile 0.537
ship vessel 0.522
ship flag 0.522
computer wallace 0.595
computer software 0.592
computer microfilm 0.559
computer microchip 0.536
computer technology 0.52
30. Are Word2vec and GloVe enough?
• Both the embeddings can not deal with out of vocabulary words.
• Both can capture the context, but in a limited sense.
• They always produce single embedding for the word in cosideration.
• They can’t distinguish:
• “I went to a bank.” and “I was standing at a river bank.”
• It will always produce single representation for both the context.
• Both gives decent performance than encoding like tf-idf, count vector etc.
• Does pretrained model helps the case?
• Pretrained models on huge corpus shows better performance compared to small corpus.
• Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanford’s
website.
1. https://nlp.stanford.edu/projects/glove/
2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
31. Fasttext
• It was proposed by Facebook AI team.
• It was primarily meant to handle the out of vocabulary issue of GloVe and Word2vec.
• It is an extension of Word2vec.
• This model relies on n-gram character rather than word to generate the embeddings.
• This model relies on the morphological features of a word.
• The n-gram character of a word can be represented as below:
• For word <where> and n=3 the word n-gram characters are:
• <wh, whe, her, ere, re>
• The final representation for word “Where” is the sum of the vector representation of <wh,
whe, her, ere, re>.
32. Fasttext continued
• The modified scoring function is
𝑔∈𝐺𝑐
• Where 𝐺𝑐 is the set of n-grams for word 𝑤 and 𝑍𝑔 is the vector representation of n-gram.
• The n-gram vector learning enables the model to learn the representation for out-of-
vocabulary words as well.
𝑆(𝑤, 𝑐) = ∑ 𝑍𝑇
𝑉
𝑔 𝑐
33. Fasttext results
t-SNE 2D projection of fasttext embeddings (gensim implementation) of top 15 similar words, trained for 50 epoch on
Reuters news corpus from NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
crude cruz 0.582
crude barrel 0.561
crude cruise 0.501
crude crumble 0.433
crude jude 0.41
ship shipyard 0.714
ship steamship 0.703
ship shipowner 0.688
ship shipper 0.668
ship vessel 0.667
computer supercomputer 0.843
computer computerized 0.823
computer computerland 0.773
computer software 0.54
computer microfilm 0.52
34. Observation
• None of the representation can capture the contextual representation. Meaning that
representation based on the usage.
• These are based on dictionary based look up to get the embeddings.
• Limited performance on task such as question answering, summarization compare to
current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
35. ELMo
• Language is complex, the meaning of a word can vary from context to context.
• E.g.
• I went to a bank to deposit some money.
• I was standing at a river bank.
• Both the instances of “bank” have separate meaning.
• Earlier models have same meaning for the word in each scenario.
• A solution would be to have multiple level of understanding of text.
• A new model which can capture context :
• ELMo: Embeddings from Language Model representation
36. ELMo continued
• It is based on deep learning framework
• It generates contextualized embedding for a word.
• It models complex characteristics of word (e.g. syntactic and semantic features)
• It models linguistic variation contexts like polysemy.
• The author argues that it can capture abstract linguistic characteristics in the higher level
of layers.
• It is based on bi-directional LSTM based model.
• Bi-directional model helps to capture the dependency on the past words and the future words.
37. ELMo Architecture
• Block diagram of ELMo.
𝑊𝑜𝑟𝑑 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑥
𝑖
LSTM LSTM
LSTM LSTM
Softmax
⋯ ⋯
L-layers
• The number of layers in original
implementation was two.
• Word embedding is calculated by char-
CNN
• Final embeddings are generated as a
weighted sum of hidden layers and
embedding layer.
• Three different representation can be
obtained
• Hidden layer 1
• Hidden layer 2
• Weighted sum of hidden layer and
embedding layer
• Go through each block in coming slide.
38. ELMo input Embedding
• Input embedding is generated by a combination of layers with char CNN and highway
network.
Character embedding
CNN with Max pool
2 layer highway network
LSTM
India
https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
39. Character CNN embedding and Highway network
Embedding Layer
• In first step look up is performed
to get character embedding.
• 1-D convolution is applied on the
embedding followed by Max
pooling layer
• Highway network acts as a gate
that determines how much
original information can pass to
output and how much via
projection.
• Final generated output is passed
as input to 2-layer LSTM
structure.
• In original paper there were two
highway layers
• Character level embedding
enables it to learn a
representation for any word. So,
it can handle out of vocabulary
words as well
Source: http://web.stanford.edu/class/cs224n/
40. LSTM layer and Embedding
• The architecture has bi-directional LSTM to predict next word from both sides. It creates biLM.
Full embedding method is described in picture
•
Concatenate hidden layer
representation
𝑘2
ℎ𝐿𝑀
𝑘1
ℎ𝐿𝑀
𝑘0
ℎ𝐿𝑀
𝐿𝑀
ℎ𝑘𝑖
𝐿𝑀
ℎ𝑘𝑖
𝑋𝑘 𝑋𝑘
ℎ𝑘,𝑗
• Two separate LSTM layers
implements language model
in both direction.
• Forward LM predicts top
layer predicts the next token
using the softmax layer.
• Similarly backward LM
predicts the past token
using the softmax score
• Each of the forward L-layers
of LSTM generates
contextualized
representation for the words
𝐿𝑀
𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . ..
𝐿
41. LSTM layer and Embedding continued
ℎ𝑘,𝑗
• Similarly, Each of the backward L-layers of LSTM generates contextualized
𝐿𝑀
representation for the words 𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . . .
𝐿.
• There are total 2𝐿 + 1 representation generated. 2𝐿 by the hidden layer and one by the
embedding layer.
• Final representation will be a weighted combination of the hidden concatenated vector
and Embedding layer.
42. ELMo continued
• Representation of the word ‘bank’ in different context. Vectors are projected in 2D space
based on the vector representation of the word ‘bank’
• It can generate different
embedding for a word
depending on the context.
• Projection is based on the
average of the hidden
layer’s and embedding
layer’s representation
from pretrained model
from tensorflow hub.
• It can be tuned to perform
different task like coref
resolution, sentiment
analysis, Q&A answering.
43. ELMo continued
• From the results presented by the authors, higher layer tends to capture semantic feature
and lower layers captures syntactic features.
• The second layer’s embedding outperform first layer’s embedding in word sense
disambiguation task which is a syntactic task in nature. On the other in the POS tagging
task, first layer’s embedding outperforms the second layer’s embedding.
44. Bidirectional Encoder Representations from
Transformers(BERT)
• A transformer based model to learn contextual representation.
• It is designed to pre-train deep bidirectional representations from unlabeled text by
considering both left and right context.
• It is pre-trained model.
• It uses the concept of transfer learning, similar to what ELMo does.
• Any use of BERT is based on two step process:
• Train a large language model on huge amount of data, using unsupervised or semi-supervised
method.
• Fine tune the large model for specific NLP task.
• Before we delve further into BERT. Following concepts needs to be understood.
• Attention mechanism
• Transformer architecture proposed by Vaswani et.al.
45. Attention mechanism
• This concept was brought in from computer vision task to NLP.
• First use of attention mechanism was applied by Bahdanau et al. in 2015. It was based on
the additive mechanism.
• It is similar to the way we process an image in our brain. It is like we focus on some of
the parts and infer other parts based on those information. In an image not all information
gives similar information.
• In sentence processing as well, we attend relevance between some of the words while
other gets low attention.
High level of attention
I was going to a crowded market.
Low level of attention
46. Attention mechanism
• In the above example, we tend to attend ‘market’ in the context ‘crowded’ and ‘going’ in
the context of ‘market’. On the other hand, less attention to ‘going’ and ‘context’.
• Attention mechanism in NLP was first employed in the neural machine translation task.
• Seq-to-seq model , encode-decoder model, has an issue with longer sentence. It fails to
remember relation between the distant words.
• Attention mechanism was designed to capture long distance relationship between two
words.
• The attention mechanism can be understood as a vector of importance weights.
• In the next slide, I shall try to put the basic of attention mechanism.
47. Attention mechanism continued
Key 1 Key 4
Key 3
Key 2
Score 1 Score 4
Score 3
Score 2
Query
• The intuition can be like a query fired by
us to extract some information from
database. It matches against each key
and generates a similarity score.
• Keyi and scorei are vector of some
dimension d.
• The scoring function can be of different
type:
𝑑
𝑖
𝑞 𝑖 𝑘
• Simple dot product 𝑞𝑇𝐾𝑖
• Scaled dot product 𝑞𝑇𝐾𝑖/
• General dot product 𝑞𝑇𝑊𝐾
• Additive dot product 𝑊𝑇
𝐾 + 𝑊𝑇
𝑞
𝑎1 𝑎4
𝑎3
𝑎2
× × × ×
Calculated using
softmax operation
𝑣1 𝑣2 𝑣3 𝑣4
∑ 𝑎𝑖𝑣𝑖
Attention value
48. Attention mechanism continued
• Final attention value is a vector. While 𝑎𝑖 and scale1 is a scaler.
• The general frame work of attention mechanism generates a weightage score of the value
vectors.
• Attention mechanism is also known as intra-attention.
• It is a attention mechanism to relate different words of a single sequence to generate a
representation of the same sequence.
• With the basic understanding of attention mechanism. We can move the transformer
architecture architecture proposed by Vashwani et. al.
49. Transformer model
Image source: https://arxiv.org/pdf/1706.03762.pdf
• Transformer architecture was proposed by Google AI
team.
• It is encode-decoder architecture
• Core modules of this architecture is
• Multi-head attention
• Positional Encoding
• Attention Mechanism
• Masked multi-head attention
• Residual connections
• This model solves the issue with recurrent networks:
• Failure to capture long distance word to word
relation
• RNN are sequential in nature. This architecture
can be parallelized.
• Each head of attention learns a different set of
features.
• This model does away with the recurrence.
50. Input embedding and positional encoding
• Embeddings are collected from some pre-trained model using dictionary look up.
• Unlike RNN, which takes input sequentially. It takes whole sentence as an input.
• Without positional information, it will be similar to bag of words model.
• How does positional information is calculated?
• Positional embedding is calculated by using the alternating combination of sin and cosine.
• 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = sin and 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) =
cos
𝑝𝑜𝑠 𝑝𝑜𝑠
1000
100
0
2𝑖/𝑑
2𝑖/𝑑
• Why do they use periodic function with varying frequency?
• This encoding method will generate entirely different encoding for each position as well as the
distance between two time steps will be consistent.
• The position should be deterministic.
• The author views this encoding will ensure that any 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a
linear combination of 𝑃𝐸𝑝𝑜𝑠.
51. Input embedding and positional encoding
𝒆𝟏
𝒆𝒑
𝟏
𝒑𝟏
+
𝒆𝟐
𝒆𝒑
𝟐
𝒑𝟐
+
𝒆𝟑
𝒆𝒑
𝟑
𝒑𝟑
+
Embedding vector is combined with positional vector. The generated Embedding vector fed into attention layer. In this way, positional
information is combined with the word embeddings.
53. Multi-Head attention
Linear Linear Linear
MatMul
Scale
Softmax
MatMul
Concatenate
Linear • Linear layer maps input to output as well or
change the vector dimension. Its weight are
learned in training process.
• In Case of original paper, it was the 512
dimension was projected to 64 dim vector
space.
• What do we feed into query, key and value?
• At the start of the training same copy of
the word vectors are passed as
Query(Q), Key(K) and Value(V)
• The MatMul operation will generate a matrix
that is precursor to attention filter. The values
will be scaled by 𝑑
• Finally the softmax layer will generate the
probability across all the words.
• It results in a matrix that is called as
“attention filter”.
Query Key Value
54. Multi-Head attention
• The probability matrix is multiplied with 𝑉 that will generate a representation based on the
attention score.
• This whole process was a self-attention mechanism to generate encoding with scaled dot
product scoring function.
• The above explanation corresponds to single head. There can be many such heads.
Original model has 8 heads.
• Each head’s attention filter learns different feature.
• Representation generated from the several heads are concatenated and passed to next
stacked encoder.
• Final representation from the encode stack is fed into decoder stack.
55. Add and Norm layer
• Residual connection is placed in model for
• Knowledge preservation
• Handling of vanishing gradient problem.
• In this model, normalization part is tricky in nature.
• Normalization mean and standard deviation is calculated for each of the word’s representation.
• Using calculated mean and standard deviation, the layer’s value is normalized.
56. Masked multi-head attention
• It is an important feature of the decode stack.
• Why do we need masking?
• When we are generating some output. The it should not pay attention to future words because future words
has not been predicted yet. So, those words are masked.
• Masking take place by putting future words as -∞. It will ensure that the softmax, which is an exponent,
becomes zero for the word.
• Does this model have
recurrence?
• In a cursory glance, it appears so.
• In decoder, we feed previous token as an input to the model.
• But, it was trained using concept called as teacher forcing.
• It means that when output is known then we can directly supply the output representation to the model.
• By doing so the model can be parallelized as well.
• The base model has 8 heads, with 6 layers of encode decoder stack, key and value are of
dimension 64.
57. Did Transformer changed the landscape in NLP?
• One of the best known model GPT was based on Transformer architecture.
• Current models like, BERT and its variants, GPT-2 are based on Transformer architecture.
• It started to capture the space of RNN because it can be parallelized and can capture
context.
• It has beat the benchmark in NMT task.
58. Getting back to BERT
• Proposed by Devlin et.al. in 2019.
• It is an encoder based model.
• It is based on the transformer architecture that we discussed in previous slides.
• It has only the encoder stack from the transformer architecture.
• It is a unsupervised or semi-supervised pre-trained model, fine tuned for a specific task
like Q&A, conversational AI etc
• It is sub-word model comprises of 30000 vocabulary set. BERT tokenizer will tokenize the
words. So, representation of the tokenized word may not be similar to what we have
passed as input.
• e.g: Word “embeddings” is tokenized into [‘em’, ‘##bed’, ‘##ding’, ‘##s’]
• This approach helps to address out of vocabulary words as well.
• There are two most common architecture pre-trained models are: 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒, 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒.
• 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 has 12 stacked encoders and 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 has 24 stacked encoders.
59. BERT Continued
• The base version has 12 attention heads and ‘large’ version has 16 attention heads.
• The original configuration described in the paper had 6 encoder layers and 8 attention
heads. This leads to a 512 dimension representation.
• How to collect the representation from BERT?
• BERT uses two special token [CLS] and [SEP].
• [CLS] will always be the first token of the input.
• [SEP] represents the sentence segmentation.
• We need to provide the segment id as well in the input.
• Similar to original transformer model, the encoded embeddings are being passed to
subsequent encoders.
• Each of the position will output a vector representation for token of size 768 for BERTbase
and 1024 for the large model
60. BERT Continued
• For the classification task, we only focus on the embedding of [CLS] token.
• What would be the representation for other task:
• There are several variants of the embedding collection.
• Considering the 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 with 12 encoder layer will generate 12 + 1 embeddings. One for the
input layer.
• Which layer should we use: All or some?
• There are several experiments have been performed. The best performance can be obtained by
concatenating the last 4 layer’s representation
• Next best representative would be an average of last 4 layer’s representation.
• Each layer learns different features. So, pooling strategy would be dependent upon the
specific NLP task. The above two suggestions are based on the performance on NER
tagging task.
61. BERT Architecture
• As original paper proposed, the BERT input embedding is a sum of token embedding,
segment embeddings and Position embeddings
Image source: https://arxiv.org/pdf/1810.04805.pdf
62. BERT pre-training
• BERT is pretrained using two methods:
• Masked LM model:
• Unlike the masking that we discussed in Transformer architecture. In this case, some
random words are replaced with a special token [MASK ]. Approximately 15% of all the
sub-word tokens.
• Now, the prediction can be defined as a language model, given the left and right context
𝑃(𝑡𝑖|𝑡1, 𝑡2, . . , 𝑡𝑖−1, 𝑡𝑖+1, . . . , 𝑡𝑛)
• Masking strategy are further divided into three parts:
• 80% instances it is [MASK] token
• 10% times it is a random words
• 10% time there is no change
63. BERT pre-training
• Next sentence prediction:
• This is primarily a classification task whether the next sentence follows the
previous sentence or not.
• The training set has 50% instance when sentence B follows sentence A and
50% are negative cases where sentence B is replaced with some random
sentence.
• This type of training is helpful in Q&A task. Where Question and answer are
represented as pairwise.
• The pre-trained BERT embeddings is new State-of-the-art word
embedding representation for the most of the NLP tasks.
• The original model has outperformed previous SOTA benchmarks.
64. BERT attention layer visualization
• Two sentence are taken as input
• "Who does not like chocolate"
• "Even a grown up would want to have a nice bite"
• Using BERTviz tool1, visualization of attention from second sentence to first sentence.
Attention by head 11 of layer 11 Attention by head 1 of layer 11
65. BERT attention visualization
• It appears that different heads even in same layer can capture different relation among
sentences. In head 11, all words attended ‘Chocolate’ while in head 0, attention was on
most of the words.
• In attention between same sentence. It
appears almost all the words attended ‘bite’.
The attention from ‘want’, ‘have’, ‘nice’ is
higher towards ‘bite’.
• We observe that different layer captures
different features. The type of feature captured
by each head may not be understood in exact
terms like syntactic, semantic etc.
66. BERT pre-trained models
• There are several pretrained models are available from HuggingFace1 team.
• Was 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒 with >300M is big?
• Can we squeeze the performance into some smaller model?
• Should we train more with more layers and attention heads?
• Both the questions have same answer - Yes
• Two models: DitilBERT and GPT-2-XL are the answer of above question.
• DistilBERT is smaller model with similar performance.
• GPT-2-XL has 48 layers!!!
• A better training strategy can give a better result. 𝑅𝑜𝐵𝐸𝑅𝑇𝑎𝑙𝑎𝑟𝑔𝑒 is a robustly trained
version of 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒. Similarly, a base version of RoBERTa
1. https://huggingface.co/transformers/pretrained_models.html
67. DistilBERT
• Knowledge distillation is a technique under which smaller model is trained to mimic the
behaviour of the larger model.
• It is sometimes called as teacher student learning. Where student is smaller model and
teacher is the bigger model
• It was generalized by Hinton et. al.
• Student is trained to learned the full distribution of the teacher.
• The training of the model has a small change where the student is not trained against the
gold labels but against the probabilities of the teacher
𝐿 = −∑ 𝑡𝑖 ∗ log 𝑠𝑖
68. DistilBERT
• How did the model was trained?
• The model was trained on the basis of distillation - Leibler divergence score.
• It measures the divergence between two probability distribution
𝐾𝐿(𝑡||𝑠) = ∑ 𝑝𝑖 ∗ log 𝑝𝑖 − ∑ 𝑝𝑖 ∗ log 𝑞𝑖
• The loss is the linear combination of the masked LM loss and distillation loss.
• Model parameter changes:
• The next sentence classification task objective was dropped compare to original version.
• The number of layers was reduced by a factor of two.
• Did it affected the performance?
• Yes, it did. Still, it can able to retain 95% performance of the original BERT
69. DitilBERT
• Another trick to capture the performance of the teacher was ‘weight initialization’ . The
layer’s weight was initialized with already learned weights from the master.
• It was trained on the larger batches with Masked Language model similar to the original
BERT method.
• Visualization of attention in DistilBERT
• In this case as well the word ‘chocolate’ was attended by
relevant words like ‘bite’ etc.
70. RoBERTa
• It was proposed by Facebook AI team.
• It is a training strategy to learn a better representation compare to BERT.
• Two important points that was compared to the original pre-training.
• Static masking vs dynamic masking: In original paper, [MASK] token was statically changed
before training. In this work, the data was duplicated 10 times so that different masking pattern
can be observed several times in same context. This did not improve the result.
• Training with higher batch size compared to original pre training lead to a better accuracy.
• This model dropped the next sentence prediction objective as compared to original BERT.
71. RoBERTa
• This model was trained on huge amount of data ~160GB.
• It was trained on a batch size of 8K compared to batch size of 256 in BERT.
• Last one, it was trained for longer duration.
• It was able to beat the SOTA in different task. On GLUE, it surpass the model XLnet. It
appears that a proper training can give a better result.
72. RoBERTa Visualization
• Architecturally it is similar to BERT. Attention head 4 of layer 5 attended the word
‘chocolate’ by ‘like’ and ‘bite’. It is similar to human understanding of it.
• Layer 4 has captured
similar attention like layer
10 of BERT that we have
already seen.
• This presents the view that
different features are
captured at different
layers.
73. sBERT
• It is a tuned BERT model to generate better representation for sentences. So that
performance on common similarity measures can be improved.
• In seven semantic textual similarity task, even GloVe representation performed better
than the Average of BERT encoding.
• This model fine tunes the BERT for sentence similarity.
• It is trained using Siamese and triplet network.
• In Siamese network, two network with same architecture are place with tied weights.
• The model is tuned models are task specific.
• Classification task:
• In this task the learned representation from the Siamese network on BERT 𝑢 𝑎𝑛𝑑 𝑣 are
concatenated with element wise difference of 𝑢 𝑎𝑛𝑑 𝑣 𝑖. 𝑒. |𝑢 − 𝑣|.
74. sBERT
• The concatenated representation is multiplied with a matrix 𝑊𝑡 . This matrix’s weight is
learned to increase the classification accuracy.
• Other two scenarios are
• Regression task where last two layers are
replaced with cosine similarity between
two vectors. The objective function is
Mean squared error.
• Tripple objective function is applied when
sentence 𝑎 has a positive relation with
sentence 𝑝 and negative relation with 𝑞
then the Loss function tries to place a
closer to p while farther to q.
Image source: https://arxiv.org/pdf/1908.10084.pdf
75. sBERT training
• Model is trained with different hyper parameter and adopted strategies:
• Pooling strategy:
• In this case they tried to pool BERT embedding for sentence representation using three strategies -
MAX, MEAN and [CLS]
• MEAN shows the best performance.
• For classification task the vectors 𝑃 𝑎𝑛𝑃 𝑃 were concatenated in different ways. But they
achieved the best performance when |𝑃 − 𝑃| was concatenated with 𝑃 𝑎𝑛𝑃 𝑃.
• Observation:
• The need of fine tuning is required and task specific.
• Transfer learning of BERT can be used. This is fine tuning of BERT for a specific task.
76. Pre-trained models
• Most of major models are very huge in nature and take a lot of computational resource to
train. These models are open sourced by researchers or Tech companies.
• It enables other researcher to use transfer learning to fine tune it for the task.
• Recent model in GPT series (GPT-3) has not been open sourced.
• There is another version of ELMo based on transformer architecture has also been
launched.
• As the models are getting heavier and heavier, a model by nVIDIA (Megatron LM) has
8300 M parameter!!
• So, Word Embeddings are still evolving. But, BERT and ELMo were the ‘VGG16’ moment
for NLP!!!