SlideShare a Scribd company logo
1 of 77
Word embedding
Submitted by :Shivani Choudhary (srz208250)
What is Word Embedding?
• Natural language processing (NLP) models do not work with plain text. So, a numerical
representation was required.
• Word embedding is a class of techniques where word is represented as a real value
vectors.
• It is a representation of word in a continuous vector space.
• It is a dense representation in a vector space.
• It can be represented in smaller dimension compared to sparse representation like one-
hot encoding.
• Most of the word embedding method is based on “distributional hypothesis” by Zelling
Harris.
What is word embedding? continued
• The Distributional Hypothesis is that words that occur in the same contexts tend to have
similar meanings. (Harris, 1954)
• Word embeddings are designed to capture the similarity between representation like:
meaning, morphology, context etc.
• The captured relationship helps us to work on downstream NLP task like chat-bot, text
summarization, information retrieval etc.
• It is generated by co-occurrence matrix, dimensionality reduction and neural networks.
• It can be broadly categorized in two parts: frequency-based embeddings and prediction-
based embeddings.
• The earliest work to give a vector representation was vector space model used in
information retrieval task.
Vector space model
• A document was represented in a vector space.
• The dimensionality of vector space is of size of unique words in corpora.
Term 2
Doc 1
Doc 3
Term 1
Doc 2
Term 3
Term
1
Term
2
Term
3
Doc 1 0 5 5
Doc 2 2 0 1
Doc 3 3 3 0
• Hypothetical corpora with three
words represented as dimension.
• Three doc projected in the vector
space as per their term frequency
Vector space model continued
• The document got a numerical vector representation in a vector space represented by words.
• E.g.
• Doc 1 -> [0, 5, 5]
• Doc 2 -> [2, 0, 1]
• This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus
shoots up to millions.
• It is based on term frequency.
• TF-IDF normalization is applied to reduce the weightage of frequent words like ‘the’, ‘are’ , and etc.
• One-hot encoding is a similar technique to represent a sentence/document in vector space.
• This representation gather limited information and fails to capture the context of the word.
Co-occurrence matrix
• It is applied to capture the neighbouring word that appeared with the word under
consideration. A context window is considered to calculate co-occurrence.
• E.g.:
• India won the match. I like the match.
• Co-occurrence matrix for above two sentence for context window of 1.
India won the match I like
India 1 1 0 0 0 0
won 1 1 1 0 0 0
the 0 1 1 1 0 1
match 0 0 1 1 0 0
I 0 0 0 0 1 1
like 0 0 1 0 1 1
Co-occurrence matrix continued
• Representations like One-hot encoding, Count based method and co-occurrence matrix
based methods are very sparse in nature.
• Either context was limited or absent all together.
• Single representation for word in every context.
• Relation between two words like: semantic reasoning is not possible with this
representation.
• Context is limited but predetermined.
• Long term dependencies are not captured.
Prediction based word embeddings
• It is a method to learn dense representation of word from a very high dimensional
representation.
• It is a modular representation, where a sparse vector is fed to generate a dense
representation
Word
Word
embedding
Model
[0, 1, 0, .... 0]
One hot encoded representation - India
= [0, 1, 0, .... 0]
V(India) = [0.1, 2.3, -2.1, ...., 0.1]
Language modelling
• Word Embedding models are very closely related to Language modelling.
• Language modelling tries to learn a probability distribution over the words in Vocabulary (V)
• Prime task of language model to calculate the probability a word Wi given the previous (N-1)
words, mathematically 𝑃(𝑊𝑖|𝑊𝑖−1, .. .𝑊𝑖−𝑛+1)
• Probabilities over n-gram is calculated by frequency by constituent n-gram.
• In Neural network as well we achieve the same using softmax layer.
• We calculate the log probability of 𝑊𝑖 and normalize it with the sum of the probablities over all the words.
𝑖−1
𝑖 𝑖−𝑛+1
• 𝑃(𝑊 |𝑊 , . . . 𝑊 ) =
𝑊𝑖
𝑒𝑥𝑝(ℎ𝑇𝑉’ )
∑𝑊 𝑒𝑥𝑝(ℎ𝑇𝑉’ )
𝑖∈𝑉 𝑊𝑖
𝑊
𝑊𝑖
• In this case, h is the representation from hidden layer and 𝑉𝑖
is the embedding of the word.
• The inner product of ℎ𝑇
𝑉’
𝑖
generate the log probability of word 𝑊 .
Classical Neural language model
• It was proposed by Bengio et al., 2003
• It consists of one layer feed-forward neural
network to predict next word in sequence.
• The model tries to maximize the probability
as computed by softmax.
• Bengio et.al. introduced three concepts
• Embedding layer: a layer that generates
word embeddings by multiplying an index
vector with a word embedding matrix.
Classical Neural language model continued
• Intermediate layers: One or more layers that produce an intermediate representation of
the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of
word embeddings of 𝑛 previous word
• Softmax Layer: the final layer that produces a probability distribution over words in V
• Intermediate layer can be replaced with LSTM.
• The network has computational complexity bottleneck due to softmax layer, in which
probability over the set of vocabulary needs to be computed.
• Neural based work embedding model made a significant progress with Word2vec model
proposed by Mikolov et.al. in 2013
Word2Vec
• It was proposed by Mikolov et.al. in 2013.
• It is a two layer shallow neural network trained to learn the contextual relationship.
• It places contextually similar word near to each other.
• It is a co-occurrence based model.
• Two variants of the model was proposed
• Continuous bag of words model (CBOW)
• Given the context word, predict the center word
• Order of context words are not considered, so this representation is similar to BOW.
• Skip-gram model
What does context mean?
• Context is co-occurrence of the words. It is a sliding window around the word under the
consideration.
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
Window size = 2, Yellow patches are words are in consideration, orange box are the context window
CBOW continued
One hot vectors
• Goal: Predict the center word, given the context words.
softmax layer
𝑊𝑡−2
𝑊𝑡−1
𝑊𝑡+1
𝑊𝑡+2
C
Projection matrix of shape
V x D (to be learned)
𝑊𝑖 * P
Average of context vectors
Output projection matrix M
of dimension D X V
C.M
one-hot vector of 𝑊𝑡
cross entropy loss
CBOW continued
• One hot encoded of the context words 𝑊𝑡−2 , .... 𝑊𝑡+2 is input to the model.
• Projection matrix of shape Vx D, where is Vis the total no of unique words in the corpus and
D is dimension of the dense representation, to project one-hot encoded vector into D-
dimension vector.
• Averaged context vector is projected back to V-dimensional space. Softmax layer converts
the representation into proablities for 𝑊𝑡.
• The model is trained using cross-entropy loss between the softmax layer output and the
one- hot encoded representation of 𝑊𝑡.
Skip-gram model: High level
• Goal: To predict the context word 𝑊𝑡−2 , .... 𝑊𝑡+2 given the word 𝑊𝑡
one-hot vector of 𝑊𝑡 softmax layer
One hot vectors
𝑊𝑡−2
𝑊𝑡−1
𝑊𝑡+1
𝑊𝑡+2
C
Projection matrix of shape
V x D (to be learned)
C * M
Output projection matrix M
of dimension D X V
C.M
cross entropy loss
𝑊𝑡 𝑊𝑡 ∗ 𝑃
Skip-gram continued
• A end to end flow of training:
0
0
0
1
0
0
0
0
𝑊𝑡
- - 0.2 -
- - 0.1 -
- - 0.4 -
- - 0.8 -
- - 0.2 -
𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑚𝑎𝑡𝑟𝑖𝑥𝑇
Vx1
DxV
0.2
0.1
0.4
0.8
0.2
Dx1
- - 0.8 - -
- - 0.1 - -
- - 0.4 - -
- - 0.8 - -
- - 0.2 - -
- - 0.9 - -
- - 0.2 - -
- - 0.6 - -
Context Maxtrix
(Shared with all
context word
prediction)
𝑜
VxD (𝑢𝑇)
0.1
0.2
1.3
0.4
0
.6
.7
.8
𝑡−1
Vx1 (𝑊 )
𝑢𝑇𝑉
𝑜 𝑐
𝑉𝑐
0.076
0.084
0.252
0.102
0.068
0.125
0.138
0.153
Softmax
0
0
0
0
1
0
0
0
Ground
truth
𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑢𝑜 𝑉𝑐)
This representation is taken from the Lectures of Manning on YouTube
×
= =
softmax
Skip-gram continued
• It focus on optimization of loss for each word:
• 𝑃 𝑜|𝑐 = 0
exp 𝑢𝑇𝑣𝑐
𝑘=1
∑𝑉
𝑤
exp 𝑢𝑇 𝑣𝑐
• It calculates probability of output context word given the center word c.
• The loss function which it tries to minimize is:
• Log value is calculated similar to depicted in first equation.
• Naive Training is costly because gradient calculation is of order 𝑉
• Two computationally efficient methods are proposed:
• Hierarchical Softmax
• Negative sampling
𝑇 𝑡=1
𝐽 𝜃 = − 1
∑𝑇 ∑−𝑚 ≤𝑗≤𝑚 𝑡+𝑗 𝑡
𝑗≠0
𝑙𝑜𝑔𝑝(𝑊 /𝑊 )
Skip-gram continued
• Use of negative sampling is to train is more prevalent.
• In early example, tuple like (India, the), (India, now) are the example to true cases.
• Any corrupted tuple is called as negative sample. like (India, reliant), (India, state)
• This process with modified objective function results it in a logistic regression to classify a
tuple as a true combination or a corrupt ones.
• The corrupt tuple is generated by sampling such that less frequent words are picked up
more often as a corrupt tuple.
Word Embedding visualization
Top 5 similar words
crude barrel 0.548
crude oteiba 0.464
crude netback 0.45
crude refinery 0.438
crude pipeline 0.421
ship vessel 0.623
ship port 0.575
ship tanker 0.496
ship navigation 0.471
ship crane 0.463
computer software 0.602
computer micro 0.559
computer printer 0.542
computer mainframe 0.538
computer hemdale 0.527
• Even with a smaller corpus it can
capture semanticallly relevant
words.
t-SNE 2. D projection of Word2vec (gensim implementation) embeddings of top 10 similar words, trained for 50 epoch on Reuters news
corpus from NLTK, with context len 15, vector dimension 100
Word2vec results
• The top 5 similar words when:
Context length is 30
crude barrel 0.475
crude refinery 0.438
crude stockdraws 0.427
crude yates 0.408
crude utilized 0.382
ship vessel 0.557
ship tanker 0.506
ship port 0.5
ship icebreaker 0.461
ship loaded 0.453
computer software 0.569
computer micro 0.517
computer memory 0.498
computer disk 0.495
computer printer 0.476
Embedding dimension is 300
crude refinery 0.27
crude stockdraws 0.254
crude barrel 0.244
crude utilized 0.242
crude liquefied 0.239
ship vessel 0.468
ship crew 0.318
ship tanker 0.308
ship shipbuilder 0.302
ship yard 0.288
computer software 0.441
computer disk 0.345
computer printer 0.345
computer uccel 0.338
computer scientific 0.335
Analogies
• Representation of analogy in vector space using word2vec vectors:
• Vector representation of “King-man+woman” is
roughly equivalent to the vector representation
of queen
• Using gensim and pretrained word2vec the
analogy vector generated for “King-
man+woman” 5 most closer relationships are
queen 0.7118
monarch 0.619
princess 0.5902
crown_prince 0.55
prince 0.54
Image taken from https://jalammar.github.io/illustrated-word2vec/
GloVe: Global Vectors for Word Representation
• It is a unsupervised learning to learn the word representation.
• It is based on co-occurrence matrix.
• The co-occurrence matrix built on the whole corpus.
• It is able to capture global context.
• It encompass best of the two model families:
• Local context window method
• Global matrix factorization
• Earlier matrix factorization like LSA was used to reduce the dimensionality.
• Two things that GloVe model captures:
• Statistical measure using co-occurrence matrix
• Context, by considering the neighbouring words
Glove continued
• It moves away from old matrix factorization. By considering relationship reasoning
(Semantic and syntactic), GloVe tries to learn the representation for words.
• It can be represented as:
Word co-occurrence matrix
Word feature
matrix
(Embedding
matrix)
Feature context matrix
*
=
words
context
features context
words
features
GloVe continued
• How does GloVe learns embedding?
• It considers word-word co-occurrence probabilities as the potential of relation between words.
• The authors presented a relation with “steam” and “ice” as target words.
• It is common to consider steam occur with gas and ice with solid.
• Other co-occur words are “water” and “fashion”. “Water” has some shared property while “fashion” is
irrelevant.
• Only in the ratio of probabilities cancels out the noisy words like “water” and “fashion”.
• As presented in the above table, the ratio of probabilities are maximum for 𝑃(𝑘/𝑖𝑐𝑒) /𝑃(𝑘/𝑠𝑡𝑒𝑎𝑚) is
high for solid and small for gas.
GloVe continued
• What is the optimization function for GloVe?
• In a co-occurrence matrix 𝑋 the 𝑋𝑖𝑗 represents the co-occurrence count.
• 𝑋𝑖 is the total number of times the word appears in the context.
• 𝑃𝑖𝑗 = 𝑃(𝑗/𝑖) = 𝑋𝑖𝑗/𝑋𝑖 is the probability of word j appear in the context of word 𝑋𝑖
• For a combination of three words 𝑖, 𝑗, 𝑘. Ageneral representation of the model is
• The optimization function proposed by authors are:
𝑉
˜
𝑖 𝑗 𝑘
𝐹(𝑊 ,𝑊 ,𝑊 ) =
𝑃𝑖𝑘
𝑃𝑗𝑘
𝑇 ˜
𝐽 = ∑ 𝑓(𝑋𝑖𝑗)(𝑊𝑖 𝑊𝑗 + 𝑏𝑖 + 𝑏𝑗 − log 𝑋𝑖𝑗)
𝑖,𝑗=1
2
Glove Continued
𝑖 𝑖
• Where 𝑉 is the size of vocabulary and 𝑊𝑇
and 𝑏 is the vector and bias of the word 𝑊𝑖
and 𝑊
˜ 𝑗 and 𝑏𝑗 is the context vector and its bias. The last term is the probability of
occurring i in the context of j.
• The function 𝑓(𝑋) should have following properties:
• It tends to zero at when 𝑋 →0
• It should be non-decreasing so that it can discriminate rare co-occurrence instances.
• It should not overweight frequent co-occurrence.
• The choice of 𝑓(𝑋) is
𝑓(𝑋) = (𝑋/𝑋𝑚𝑎𝑥)𝛼
𝑖𝑓 𝑋 < 𝑋𝑚𝑎𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1
Glove Continued
• The model has following computational bottlenecks:
• Creating a big co-occurrence matrix of size 𝑉𝑋 𝑉.
• The model computational complexity depends on the number of non-zero elements.
• During training the context window needs to be sufficiently large so that the model can
distinguish left context and right context.
• Words which are more distant to each other contribute less in the count. Because, distant
words contribute less to the relationship of the words.
• The model generates two set of 𝑊 𝑎𝑛𝑑 ˜
𝑊. An average of both is used as the representation
of
words.
Glove results
t-SNE 2D projection of Glove embeddings of top 5 similar words, trained for 50 epoch on Reuters news corpus from
NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
crude barrel 0.752
crude posting 0.58
crude raise 0.537
crude light 0.505
crude sour 0.502
ship loading 0.58
ship kuwaiti 0.54
ship missile 0.537
ship vessel 0.522
ship flag 0.522
computer wallace 0.595
computer software 0.592
computer microfilm 0.559
computer microchip 0.536
computer technology 0.52
Are Word2vec and GloVe enough?
• Both the embeddings can not deal with out of vocabulary words.
• Both can capture the context, but in a limited sense.
• They always produce single embedding for the word in cosideration.
• They can’t distinguish:
• “I went to a bank.” and “I was standing at a river bank.”
• It will always produce single representation for both the context.
• Both gives decent performance than encoding like tf-idf, count vector etc.
• Does pretrained model helps the case?
• Pretrained models on huge corpus shows better performance compared to small corpus.
• Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanford’s
website.
1. https://nlp.stanford.edu/projects/glove/
2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Fasttext
• It was proposed by Facebook AI team.
• It was primarily meant to handle the out of vocabulary issue of GloVe and Word2vec.
• It is an extension of Word2vec.
• This model relies on n-gram character rather than word to generate the embeddings.
• This model relies on the morphological features of a word.
• The n-gram character of a word can be represented as below:
• For word <where> and n=3 the word n-gram characters are:
• <wh, whe, her, ere, re>
• The final representation for word “Where” is the sum of the vector representation of <wh,
whe, her, ere, re>.
Fasttext continued
• The modified scoring function is
𝑔∈𝐺𝑐
• Where 𝐺𝑐 is the set of n-grams for word 𝑤 and 𝑍𝑔 is the vector representation of n-gram.
• The n-gram vector learning enables the model to learn the representation for out-of-
vocabulary words as well.
𝑆(𝑤, 𝑐) = ∑ 𝑍𝑇
𝑉
𝑔 𝑐
Fasttext results
t-SNE 2D projection of fasttext embeddings (gensim implementation) of top 15 similar words, trained for 50 epoch on
Reuters news corpus from NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
crude cruz 0.582
crude barrel 0.561
crude cruise 0.501
crude crumble 0.433
crude jude 0.41
ship shipyard 0.714
ship steamship 0.703
ship shipowner 0.688
ship shipper 0.668
ship vessel 0.667
computer supercomputer 0.843
computer computerized 0.823
computer computerland 0.773
computer software 0.54
computer microfilm 0.52
Observation
• None of the representation can capture the contextual representation. Meaning that
representation based on the usage.
• These are based on dictionary based look up to get the embeddings.
• Limited performance on task such as question answering, summarization compare to
current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
ELMo
• Language is complex, the meaning of a word can vary from context to context.
• E.g.
• I went to a bank to deposit some money.
• I was standing at a river bank.
• Both the instances of “bank” have separate meaning.
• Earlier models have same meaning for the word in each scenario.
• A solution would be to have multiple level of understanding of text.
• A new model which can capture context :
• ELMo: Embeddings from Language Model representation
ELMo continued
• It is based on deep learning framework
• It generates contextualized embedding for a word.
• It models complex characteristics of word (e.g. syntactic and semantic features)
• It models linguistic variation contexts like polysemy.
• The author argues that it can capture abstract linguistic characteristics in the higher level
of layers.
• It is based on bi-directional LSTM based model.
• Bi-directional model helps to capture the dependency on the past words and the future words.
ELMo Architecture
• Block diagram of ELMo.
𝑊𝑜𝑟𝑑 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑥
𝑖
LSTM LSTM
LSTM LSTM
Softmax
⋯ ⋯
L-layers
• The number of layers in original
implementation was two.
• Word embedding is calculated by char-
CNN
• Final embeddings are generated as a
weighted sum of hidden layers and
embedding layer.
• Three different representation can be
obtained
• Hidden layer 1
• Hidden layer 2
• Weighted sum of hidden layer and
embedding layer
• Go through each block in coming slide.
ELMo input Embedding
• Input embedding is generated by a combination of layers with char CNN and highway
network.
Character embedding
CNN with Max pool
2 layer highway network
LSTM
India
https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
Character CNN embedding and Highway network
Embedding Layer
• In first step look up is performed
to get character embedding.
• 1-D convolution is applied on the
embedding followed by Max
pooling layer
• Highway network acts as a gate
that determines how much
original information can pass to
output and how much via
projection.
• Final generated output is passed
as input to 2-layer LSTM
structure.
• In original paper there were two
highway layers
• Character level embedding
enables it to learn a
representation for any word. So,
it can handle out of vocabulary
words as well
Source: http://web.stanford.edu/class/cs224n/
LSTM layer and Embedding
• The architecture has bi-directional LSTM to predict next word from both sides. It creates biLM.
Full embedding method is described in picture
•
Concatenate hidden layer
representation
𝑘2
ℎ𝐿𝑀
𝑘1
ℎ𝐿𝑀
𝑘0
ℎ𝐿𝑀
𝐿𝑀
ℎ𝑘𝑖
𝐿𝑀
ℎ𝑘𝑖
𝑋𝑘 𝑋𝑘
ℎ𝑘,𝑗
• Two separate LSTM layers
implements language model
in both direction.
• Forward LM predicts top
layer predicts the next token
using the softmax layer.
• Similarly backward LM
predicts the past token
using the softmax score
• Each of the forward L-layers
of LSTM generates
contextualized
representation for the words
𝐿𝑀
𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . ..
𝐿
LSTM layer and Embedding continued
ℎ𝑘,𝑗
• Similarly, Each of the backward L-layers of LSTM generates contextualized
𝐿𝑀
representation for the words 𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . . .
𝐿.
• There are total 2𝐿 + 1 representation generated. 2𝐿 by the hidden layer and one by the
embedding layer.
• Final representation will be a weighted combination of the hidden concatenated vector
and Embedding layer.
ELMo continued
• Representation of the word ‘bank’ in different context. Vectors are projected in 2D space
based on the vector representation of the word ‘bank’
• It can generate different
embedding for a word
depending on the context.
• Projection is based on the
average of the hidden
layer’s and embedding
layer’s representation
from pretrained model
from tensorflow hub.
• It can be tuned to perform
different task like coref
resolution, sentiment
analysis, Q&A answering.
ELMo continued
• From the results presented by the authors, higher layer tends to capture semantic feature
and lower layers captures syntactic features.
• The second layer’s embedding outperform first layer’s embedding in word sense
disambiguation task which is a syntactic task in nature. On the other in the POS tagging
task, first layer’s embedding outperforms the second layer’s embedding.
Bidirectional Encoder Representations from
Transformers(BERT)
• A transformer based model to learn contextual representation.
• It is designed to pre-train deep bidirectional representations from unlabeled text by
considering both left and right context.
• It is pre-trained model.
• It uses the concept of transfer learning, similar to what ELMo does.
• Any use of BERT is based on two step process:
• Train a large language model on huge amount of data, using unsupervised or semi-supervised
method.
• Fine tune the large model for specific NLP task.
• Before we delve further into BERT. Following concepts needs to be understood.
• Attention mechanism
• Transformer architecture proposed by Vaswani et.al.
Attention mechanism
• This concept was brought in from computer vision task to NLP.
• First use of attention mechanism was applied by Bahdanau et al. in 2015. It was based on
the additive mechanism.
• It is similar to the way we process an image in our brain. It is like we focus on some of
the parts and infer other parts based on those information. In an image not all information
gives similar information.
• In sentence processing as well, we attend relevance between some of the words while
other gets low attention.
High level of attention
I was going to a crowded market.
Low level of attention
Attention mechanism
• In the above example, we tend to attend ‘market’ in the context ‘crowded’ and ‘going’ in
the context of ‘market’. On the other hand, less attention to ‘going’ and ‘context’.
• Attention mechanism in NLP was first employed in the neural machine translation task.
• Seq-to-seq model , encode-decoder model, has an issue with longer sentence. It fails to
remember relation between the distant words.
• Attention mechanism was designed to capture long distance relationship between two
words.
• The attention mechanism can be understood as a vector of importance weights.
• In the next slide, I shall try to put the basic of attention mechanism.
Attention mechanism continued
Key 1 Key 4
Key 3
Key 2
Score 1 Score 4
Score 3
Score 2
Query
• The intuition can be like a query fired by
us to extract some information from
database. It matches against each key
and generates a similarity score.
• Keyi and scorei are vector of some
dimension d.
• The scoring function can be of different
type:
𝑑
𝑖
𝑞 𝑖 𝑘
• Simple dot product 𝑞𝑇𝐾𝑖
• Scaled dot product 𝑞𝑇𝐾𝑖/
• General dot product 𝑞𝑇𝑊𝐾
• Additive dot product 𝑊𝑇
𝐾 + 𝑊𝑇
𝑞
𝑎1 𝑎4
𝑎3
𝑎2
× × × ×
Calculated using
softmax operation
𝑣1 𝑣2 𝑣3 𝑣4
∑ 𝑎𝑖𝑣𝑖
Attention value
Attention mechanism continued
• Final attention value is a vector. While 𝑎𝑖 and scale1 is a scaler.
• The general frame work of attention mechanism generates a weightage score of the value
vectors.
• Attention mechanism is also known as intra-attention.
• It is a attention mechanism to relate different words of a single sequence to generate a
representation of the same sequence.
• With the basic understanding of attention mechanism. We can move the transformer
architecture architecture proposed by Vashwani et. al.
Transformer model
Image source: https://arxiv.org/pdf/1706.03762.pdf
• Transformer architecture was proposed by Google AI
team.
• It is encode-decoder architecture
• Core modules of this architecture is
• Multi-head attention
• Positional Encoding
• Attention Mechanism
• Masked multi-head attention
• Residual connections
• This model solves the issue with recurrent networks:
• Failure to capture long distance word to word
relation
• RNN are sequential in nature. This architecture
can be parallelized.
• Each head of attention learns a different set of
features.
• This model does away with the recurrence.
Input embedding and positional encoding
• Embeddings are collected from some pre-trained model using dictionary look up.
• Unlike RNN, which takes input sequentially. It takes whole sentence as an input.
• Without positional information, it will be similar to bag of words model.
• How does positional information is calculated?
• Positional embedding is calculated by using the alternating combination of sin and cosine.
• 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = sin and 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) =
cos
𝑝𝑜𝑠 𝑝𝑜𝑠
1000
100
0
2𝑖/𝑑
2𝑖/𝑑
• Why do they use periodic function with varying frequency?
• This encoding method will generate entirely different encoding for each position as well as the
distance between two time steps will be consistent.
• The position should be deterministic.
• The author views this encoding will ensure that any 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a
linear combination of 𝑃𝐸𝑝𝑜𝑠.
Input embedding and positional encoding
𝒆𝟏
𝒆𝒑
𝟏
𝒑𝟏
+
𝒆𝟐
𝒆𝒑
𝟐
𝒑𝟐
+
𝒆𝟑
𝒆𝒑
𝟑
𝒑𝟑
+
Embedding vector is combined with positional vector. The generated Embedding vector fed into attention layer. In this way, positional
information is combined with the word embeddings.
Multi-head attention.
Linear Linear Linear
MatMul
Scale
Softmax
MatMul
Concatenate
Linear
Multi-Head attention
Query Key Value
Multi-Head attention
Linear Linear Linear
MatMul
Scale
Softmax
MatMul
Concatenate
Linear • Linear layer maps input to output as well or
change the vector dimension. Its weight are
learned in training process.
• In Case of original paper, it was the 512
dimension was projected to 64 dim vector
space.
• What do we feed into query, key and value?
• At the start of the training same copy of
the word vectors are passed as
Query(Q), Key(K) and Value(V)
• The MatMul operation will generate a matrix
that is precursor to attention filter. The values
will be scaled by 𝑑
• Finally the softmax layer will generate the
probability across all the words.
• It results in a matrix that is called as
“attention filter”.
Query Key Value
Multi-Head attention
• The probability matrix is multiplied with 𝑉 that will generate a representation based on the
attention score.
• This whole process was a self-attention mechanism to generate encoding with scaled dot
product scoring function.
• The above explanation corresponds to single head. There can be many such heads.
Original model has 8 heads.
• Each head’s attention filter learns different feature.
• Representation generated from the several heads are concatenated and passed to next
stacked encoder.
• Final representation from the encode stack is fed into decoder stack.
Add and Norm layer
• Residual connection is placed in model for
• Knowledge preservation
• Handling of vanishing gradient problem.
• In this model, normalization part is tricky in nature.
• Normalization mean and standard deviation is calculated for each of the word’s representation.
• Using calculated mean and standard deviation, the layer’s value is normalized.
Masked multi-head attention
• It is an important feature of the decode stack.
• Why do we need masking?
• When we are generating some output. The it should not pay attention to future words because future words
has not been predicted yet. So, those words are masked.
• Masking take place by putting future words as -∞. It will ensure that the softmax, which is an exponent,
becomes zero for the word.
• Does this model have
recurrence?
• In a cursory glance, it appears so.
• In decoder, we feed previous token as an input to the model.
• But, it was trained using concept called as teacher forcing.
• It means that when output is known then we can directly supply the output representation to the model.
• By doing so the model can be parallelized as well.
• The base model has 8 heads, with 6 layers of encode decoder stack, key and value are of
dimension 64.
Did Transformer changed the landscape in NLP?
• One of the best known model GPT was based on Transformer architecture.
• Current models like, BERT and its variants, GPT-2 are based on Transformer architecture.
• It started to capture the space of RNN because it can be parallelized and can capture
context.
• It has beat the benchmark in NMT task.
Getting back to BERT
• Proposed by Devlin et.al. in 2019.
• It is an encoder based model.
• It is based on the transformer architecture that we discussed in previous slides.
• It has only the encoder stack from the transformer architecture.
• It is a unsupervised or semi-supervised pre-trained model, fine tuned for a specific task
like Q&A, conversational AI etc
• It is sub-word model comprises of 30000 vocabulary set. BERT tokenizer will tokenize the
words. So, representation of the tokenized word may not be similar to what we have
passed as input.
• e.g: Word “embeddings” is tokenized into [‘em’, ‘##bed’, ‘##ding’, ‘##s’]
• This approach helps to address out of vocabulary words as well.
• There are two most common architecture pre-trained models are: 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒, 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒.
• 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 has 12 stacked encoders and 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 has 24 stacked encoders.
BERT Continued
• The base version has 12 attention heads and ‘large’ version has 16 attention heads.
• The original configuration described in the paper had 6 encoder layers and 8 attention
heads. This leads to a 512 dimension representation.
• How to collect the representation from BERT?
• BERT uses two special token [CLS] and [SEP].
• [CLS] will always be the first token of the input.
• [SEP] represents the sentence segmentation.
• We need to provide the segment id as well in the input.
• Similar to original transformer model, the encoded embeddings are being passed to
subsequent encoders.
• Each of the position will output a vector representation for token of size 768 for BERTbase
and 1024 for the large model
BERT Continued
• For the classification task, we only focus on the embedding of [CLS] token.
• What would be the representation for other task:
• There are several variants of the embedding collection.
• Considering the 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 with 12 encoder layer will generate 12 + 1 embeddings. One for the
input layer.
• Which layer should we use: All or some?
• There are several experiments have been performed. The best performance can be obtained by
concatenating the last 4 layer’s representation
• Next best representative would be an average of last 4 layer’s representation.
• Each layer learns different features. So, pooling strategy would be dependent upon the
specific NLP task. The above two suggestions are based on the performance on NER
tagging task.
BERT Architecture
• As original paper proposed, the BERT input embedding is a sum of token embedding,
segment embeddings and Position embeddings
Image source: https://arxiv.org/pdf/1810.04805.pdf
BERT pre-training
• BERT is pretrained using two methods:
• Masked LM model:
• Unlike the masking that we discussed in Transformer architecture. In this case, some
random words are replaced with a special token [MASK ]. Approximately 15% of all the
sub-word tokens.
• Now, the prediction can be defined as a language model, given the left and right context
𝑃(𝑡𝑖|𝑡1, 𝑡2, . . , 𝑡𝑖−1, 𝑡𝑖+1, . . . , 𝑡𝑛)
• Masking strategy are further divided into three parts:
• 80% instances it is [MASK] token
• 10% times it is a random words
• 10% time there is no change
BERT pre-training
• Next sentence prediction:
• This is primarily a classification task whether the next sentence follows the
previous sentence or not.
• The training set has 50% instance when sentence B follows sentence A and
50% are negative cases where sentence B is replaced with some random
sentence.
• This type of training is helpful in Q&A task. Where Question and answer are
represented as pairwise.
• The pre-trained BERT embeddings is new State-of-the-art word
embedding representation for the most of the NLP tasks.
• The original model has outperformed previous SOTA benchmarks.
BERT attention layer visualization
• Two sentence are taken as input
• "Who does not like chocolate"
• "Even a grown up would want to have a nice bite"
• Using BERTviz tool1, visualization of attention from second sentence to first sentence.
Attention by head 11 of layer 11 Attention by head 1 of layer 11
BERT attention visualization
• It appears that different heads even in same layer can capture different relation among
sentences. In head 11, all words attended ‘Chocolate’ while in head 0, attention was on
most of the words.
• In attention between same sentence. It
appears almost all the words attended ‘bite’.
The attention from ‘want’, ‘have’, ‘nice’ is
higher towards ‘bite’.
• We observe that different layer captures
different features. The type of feature captured
by each head may not be understood in exact
terms like syntactic, semantic etc.
BERT pre-trained models
• There are several pretrained models are available from HuggingFace1 team.
• Was 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒 with >300M is big?
• Can we squeeze the performance into some smaller model?
• Should we train more with more layers and attention heads?
• Both the questions have same answer - Yes
• Two models: DitilBERT and GPT-2-XL are the answer of above question.
• DistilBERT is smaller model with similar performance.
• GPT-2-XL has 48 layers!!!
• A better training strategy can give a better result. 𝑅𝑜𝐵𝐸𝑅𝑇𝑎𝑙𝑎𝑟𝑔𝑒 is a robustly trained
version of 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒. Similarly, a base version of RoBERTa
1. https://huggingface.co/transformers/pretrained_models.html
DistilBERT
• Knowledge distillation is a technique under which smaller model is trained to mimic the
behaviour of the larger model.
• It is sometimes called as teacher student learning. Where student is smaller model and
teacher is the bigger model
• It was generalized by Hinton et. al.
• Student is trained to learned the full distribution of the teacher.
• The training of the model has a small change where the student is not trained against the
gold labels but against the probabilities of the teacher
𝐿 = −∑ 𝑡𝑖 ∗ log 𝑠𝑖
DistilBERT
• How did the model was trained?
• The model was trained on the basis of distillation - Leibler divergence score.
• It measures the divergence between two probability distribution
𝐾𝐿(𝑡||𝑠) = ∑ 𝑝𝑖 ∗ log 𝑝𝑖 − ∑ 𝑝𝑖 ∗ log 𝑞𝑖
• The loss is the linear combination of the masked LM loss and distillation loss.
• Model parameter changes:
• The next sentence classification task objective was dropped compare to original version.
• The number of layers was reduced by a factor of two.
• Did it affected the performance?
• Yes, it did. Still, it can able to retain 95% performance of the original BERT
DitilBERT
• Another trick to capture the performance of the teacher was ‘weight initialization’ . The
layer’s weight was initialized with already learned weights from the master.
• It was trained on the larger batches with Masked Language model similar to the original
BERT method.
• Visualization of attention in DistilBERT
• In this case as well the word ‘chocolate’ was attended by
relevant words like ‘bite’ etc.
RoBERTa
• It was proposed by Facebook AI team.
• It is a training strategy to learn a better representation compare to BERT.
• Two important points that was compared to the original pre-training.
• Static masking vs dynamic masking: In original paper, [MASK] token was statically changed
before training. In this work, the data was duplicated 10 times so that different masking pattern
can be observed several times in same context. This did not improve the result.
• Training with higher batch size compared to original pre training lead to a better accuracy.
• This model dropped the next sentence prediction objective as compared to original BERT.
RoBERTa
• This model was trained on huge amount of data ~160GB.
• It was trained on a batch size of 8K compared to batch size of 256 in BERT.
• Last one, it was trained for longer duration.
• It was able to beat the SOTA in different task. On GLUE, it surpass the model XLnet. It
appears that a proper training can give a better result.
RoBERTa Visualization
• Architecturally it is similar to BERT. Attention head 4 of layer 5 attended the word
‘chocolate’ by ‘like’ and ‘bite’. It is similar to human understanding of it.
• Layer 4 has captured
similar attention like layer
10 of BERT that we have
already seen.
• This presents the view that
different features are
captured at different
layers.
sBERT
• It is a tuned BERT model to generate better representation for sentences. So that
performance on common similarity measures can be improved.
• In seven semantic textual similarity task, even GloVe representation performed better
than the Average of BERT encoding.
• This model fine tunes the BERT for sentence similarity.
• It is trained using Siamese and triplet network.
• In Siamese network, two network with same architecture are place with tied weights.
• The model is tuned models are task specific.
• Classification task:
• In this task the learned representation from the Siamese network on BERT 𝑢 𝑎𝑛𝑑 𝑣 are
concatenated with element wise difference of 𝑢 𝑎𝑛𝑑 𝑣 𝑖. 𝑒. |𝑢 − 𝑣|.
sBERT
• The concatenated representation is multiplied with a matrix 𝑊𝑡 . This matrix’s weight is
learned to increase the classification accuracy.
• Other two scenarios are
• Regression task where last two layers are
replaced with cosine similarity between
two vectors. The objective function is
Mean squared error.
• Tripple objective function is applied when
sentence 𝑎 has a positive relation with
sentence 𝑝 and negative relation with 𝑞
then the Loss function tries to place a
closer to p while farther to q.
Image source: https://arxiv.org/pdf/1908.10084.pdf
sBERT training
• Model is trained with different hyper parameter and adopted strategies:
• Pooling strategy:
• In this case they tried to pool BERT embedding for sentence representation using three strategies -
MAX, MEAN and [CLS]
• MEAN shows the best performance.
• For classification task the vectors 𝑃 𝑎𝑛𝑃 𝑃 were concatenated in different ways. But they
achieved the best performance when |𝑃 − 𝑃| was concatenated with 𝑃 𝑎𝑛𝑃 𝑃.
• Observation:
• The need of fine tuning is required and task specific.
• Transfer learning of BERT can be used. This is fine tuning of BERT for a specific task.
Pre-trained models
• Most of major models are very huge in nature and take a lot of computational resource to
train. These models are open sourced by researchers or Tech companies.
• It enables other researcher to use transfer learning to fine tune it for the task.
• Recent model in GPT series (GPT-3) has not been open sourced.
• There is another version of ELMo based on transformer architecture has also been
launched.
• As the models are getting heavier and heavier, a model by nVIDIA (Megatron LM) has
8300 M parameter!!
• So, Word Embeddings are still evolving. But, BERT and ELMo were the ‘VGG16’ moment
for NLP!!!
Thank You.

More Related Content

Similar to wordembedding.pptx

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random IndexingDileepa Jayakody
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexingDileepa Jayakody
 
State-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsState-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsAusaf Ahmed
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 

Similar to wordembedding.pptx (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
IR.pptx
IR.pptxIR.pptx
IR.pptx
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random Indexing
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexing
 
State-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsState-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word Representations
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 

Recently uploaded

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 

Recently uploaded (20)

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 

wordembedding.pptx

  • 1. Word embedding Submitted by :Shivani Choudhary (srz208250)
  • 2. What is Word Embedding? • Natural language processing (NLP) models do not work with plain text. So, a numerical representation was required. • Word embedding is a class of techniques where word is represented as a real value vectors. • It is a representation of word in a continuous vector space. • It is a dense representation in a vector space. • It can be represented in smaller dimension compared to sparse representation like one- hot encoding. • Most of the word embedding method is based on “distributional hypothesis” by Zelling Harris.
  • 3. What is word embedding? continued • The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings. (Harris, 1954) • Word embeddings are designed to capture the similarity between representation like: meaning, morphology, context etc. • The captured relationship helps us to work on downstream NLP task like chat-bot, text summarization, information retrieval etc. • It is generated by co-occurrence matrix, dimensionality reduction and neural networks. • It can be broadly categorized in two parts: frequency-based embeddings and prediction- based embeddings. • The earliest work to give a vector representation was vector space model used in information retrieval task.
  • 4. Vector space model • A document was represented in a vector space. • The dimensionality of vector space is of size of unique words in corpora. Term 2 Doc 1 Doc 3 Term 1 Doc 2 Term 3 Term 1 Term 2 Term 3 Doc 1 0 5 5 Doc 2 2 0 1 Doc 3 3 3 0 • Hypothetical corpora with three words represented as dimension. • Three doc projected in the vector space as per their term frequency
  • 5. Vector space model continued • The document got a numerical vector representation in a vector space represented by words. • E.g. • Doc 1 -> [0, 5, 5] • Doc 2 -> [2, 0, 1] • This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus shoots up to millions. • It is based on term frequency. • TF-IDF normalization is applied to reduce the weightage of frequent words like ‘the’, ‘are’ , and etc. • One-hot encoding is a similar technique to represent a sentence/document in vector space. • This representation gather limited information and fails to capture the context of the word.
  • 6. Co-occurrence matrix • It is applied to capture the neighbouring word that appeared with the word under consideration. A context window is considered to calculate co-occurrence. • E.g.: • India won the match. I like the match. • Co-occurrence matrix for above two sentence for context window of 1. India won the match I like India 1 1 0 0 0 0 won 1 1 1 0 0 0 the 0 1 1 1 0 1 match 0 0 1 1 0 0 I 0 0 0 0 1 1 like 0 0 1 0 1 1
  • 7. Co-occurrence matrix continued • Representations like One-hot encoding, Count based method and co-occurrence matrix based methods are very sparse in nature. • Either context was limited or absent all together. • Single representation for word in every context. • Relation between two words like: semantic reasoning is not possible with this representation. • Context is limited but predetermined. • Long term dependencies are not captured.
  • 8. Prediction based word embeddings • It is a method to learn dense representation of word from a very high dimensional representation. • It is a modular representation, where a sparse vector is fed to generate a dense representation Word Word embedding Model [0, 1, 0, .... 0] One hot encoded representation - India = [0, 1, 0, .... 0] V(India) = [0.1, 2.3, -2.1, ...., 0.1]
  • 9. Language modelling • Word Embedding models are very closely related to Language modelling. • Language modelling tries to learn a probability distribution over the words in Vocabulary (V) • Prime task of language model to calculate the probability a word Wi given the previous (N-1) words, mathematically 𝑃(𝑊𝑖|𝑊𝑖−1, .. .𝑊𝑖−𝑛+1) • Probabilities over n-gram is calculated by frequency by constituent n-gram. • In Neural network as well we achieve the same using softmax layer. • We calculate the log probability of 𝑊𝑖 and normalize it with the sum of the probablities over all the words. 𝑖−1 𝑖 𝑖−𝑛+1 • 𝑃(𝑊 |𝑊 , . . . 𝑊 ) = 𝑊𝑖 𝑒𝑥𝑝(ℎ𝑇𝑉’ ) ∑𝑊 𝑒𝑥𝑝(ℎ𝑇𝑉’ ) 𝑖∈𝑉 𝑊𝑖 𝑊 𝑊𝑖 • In this case, h is the representation from hidden layer and 𝑉𝑖 is the embedding of the word. • The inner product of ℎ𝑇 𝑉’ 𝑖 generate the log probability of word 𝑊 .
  • 10. Classical Neural language model • It was proposed by Bengio et al., 2003 • It consists of one layer feed-forward neural network to predict next word in sequence. • The model tries to maximize the probability as computed by softmax. • Bengio et.al. introduced three concepts • Embedding layer: a layer that generates word embeddings by multiplying an index vector with a word embedding matrix.
  • 11. Classical Neural language model continued • Intermediate layers: One or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of 𝑛 previous word • Softmax Layer: the final layer that produces a probability distribution over words in V • Intermediate layer can be replaced with LSTM. • The network has computational complexity bottleneck due to softmax layer, in which probability over the set of vocabulary needs to be computed. • Neural based work embedding model made a significant progress with Word2vec model proposed by Mikolov et.al. in 2013
  • 12. Word2Vec • It was proposed by Mikolov et.al. in 2013. • It is a two layer shallow neural network trained to learn the contextual relationship. • It places contextually similar word near to each other. • It is a co-occurrence based model. • Two variants of the model was proposed • Continuous bag of words model (CBOW) • Given the context word, predict the center word • Order of context words are not considered, so this representation is similar to BOW. • Skip-gram model
  • 13. What does context mean? • Context is co-occurrence of the words. It is a sliding window around the word under the consideration. India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state India is now inching towards a self reliant state Window size = 2, Yellow patches are words are in consideration, orange box are the context window
  • 14. CBOW continued One hot vectors • Goal: Predict the center word, given the context words. softmax layer 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡+1 𝑊𝑡+2 C Projection matrix of shape V x D (to be learned) 𝑊𝑖 * P Average of context vectors Output projection matrix M of dimension D X V C.M one-hot vector of 𝑊𝑡 cross entropy loss
  • 15. CBOW continued • One hot encoded of the context words 𝑊𝑡−2 , .... 𝑊𝑡+2 is input to the model. • Projection matrix of shape Vx D, where is Vis the total no of unique words in the corpus and D is dimension of the dense representation, to project one-hot encoded vector into D- dimension vector. • Averaged context vector is projected back to V-dimensional space. Softmax layer converts the representation into proablities for 𝑊𝑡. • The model is trained using cross-entropy loss between the softmax layer output and the one- hot encoded representation of 𝑊𝑡.
  • 16. Skip-gram model: High level • Goal: To predict the context word 𝑊𝑡−2 , .... 𝑊𝑡+2 given the word 𝑊𝑡 one-hot vector of 𝑊𝑡 softmax layer One hot vectors 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡+1 𝑊𝑡+2 C Projection matrix of shape V x D (to be learned) C * M Output projection matrix M of dimension D X V C.M cross entropy loss 𝑊𝑡 𝑊𝑡 ∗ 𝑃
  • 17. Skip-gram continued • A end to end flow of training: 0 0 0 1 0 0 0 0 𝑊𝑡 - - 0.2 - - - 0.1 - - - 0.4 - - - 0.8 - - - 0.2 - 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑚𝑎𝑡𝑟𝑖𝑥𝑇 Vx1 DxV 0.2 0.1 0.4 0.8 0.2 Dx1 - - 0.8 - - - - 0.1 - - - - 0.4 - - - - 0.8 - - - - 0.2 - - - - 0.9 - - - - 0.2 - - - - 0.6 - - Context Maxtrix (Shared with all context word prediction) 𝑜 VxD (𝑢𝑇) 0.1 0.2 1.3 0.4 0 .6 .7 .8 𝑡−1 Vx1 (𝑊 ) 𝑢𝑇𝑉 𝑜 𝑐 𝑉𝑐 0.076 0.084 0.252 0.102 0.068 0.125 0.138 0.153 Softmax 0 0 0 0 1 0 0 0 Ground truth 𝑇 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑢𝑜 𝑉𝑐) This representation is taken from the Lectures of Manning on YouTube × = = softmax
  • 18. Skip-gram continued • It focus on optimization of loss for each word: • 𝑃 𝑜|𝑐 = 0 exp 𝑢𝑇𝑣𝑐 𝑘=1 ∑𝑉 𝑤 exp 𝑢𝑇 𝑣𝑐 • It calculates probability of output context word given the center word c. • The loss function which it tries to minimize is: • Log value is calculated similar to depicted in first equation. • Naive Training is costly because gradient calculation is of order 𝑉 • Two computationally efficient methods are proposed: • Hierarchical Softmax • Negative sampling 𝑇 𝑡=1 𝐽 𝜃 = − 1 ∑𝑇 ∑−𝑚 ≤𝑗≤𝑚 𝑡+𝑗 𝑡 𝑗≠0 𝑙𝑜𝑔𝑝(𝑊 /𝑊 )
  • 19. Skip-gram continued • Use of negative sampling is to train is more prevalent. • In early example, tuple like (India, the), (India, now) are the example to true cases. • Any corrupted tuple is called as negative sample. like (India, reliant), (India, state) • This process with modified objective function results it in a logistic regression to classify a tuple as a true combination or a corrupt ones. • The corrupt tuple is generated by sampling such that less frequent words are picked up more often as a corrupt tuple.
  • 20. Word Embedding visualization Top 5 similar words crude barrel 0.548 crude oteiba 0.464 crude netback 0.45 crude refinery 0.438 crude pipeline 0.421 ship vessel 0.623 ship port 0.575 ship tanker 0.496 ship navigation 0.471 ship crane 0.463 computer software 0.602 computer micro 0.559 computer printer 0.542 computer mainframe 0.538 computer hemdale 0.527 • Even with a smaller corpus it can capture semanticallly relevant words. t-SNE 2. D projection of Word2vec (gensim implementation) embeddings of top 10 similar words, trained for 50 epoch on Reuters news corpus from NLTK, with context len 15, vector dimension 100
  • 21. Word2vec results • The top 5 similar words when: Context length is 30 crude barrel 0.475 crude refinery 0.438 crude stockdraws 0.427 crude yates 0.408 crude utilized 0.382 ship vessel 0.557 ship tanker 0.506 ship port 0.5 ship icebreaker 0.461 ship loaded 0.453 computer software 0.569 computer micro 0.517 computer memory 0.498 computer disk 0.495 computer printer 0.476 Embedding dimension is 300 crude refinery 0.27 crude stockdraws 0.254 crude barrel 0.244 crude utilized 0.242 crude liquefied 0.239 ship vessel 0.468 ship crew 0.318 ship tanker 0.308 ship shipbuilder 0.302 ship yard 0.288 computer software 0.441 computer disk 0.345 computer printer 0.345 computer uccel 0.338 computer scientific 0.335
  • 22. Analogies • Representation of analogy in vector space using word2vec vectors: • Vector representation of “King-man+woman” is roughly equivalent to the vector representation of queen • Using gensim and pretrained word2vec the analogy vector generated for “King- man+woman” 5 most closer relationships are queen 0.7118 monarch 0.619 princess 0.5902 crown_prince 0.55 prince 0.54 Image taken from https://jalammar.github.io/illustrated-word2vec/
  • 23. GloVe: Global Vectors for Word Representation • It is a unsupervised learning to learn the word representation. • It is based on co-occurrence matrix. • The co-occurrence matrix built on the whole corpus. • It is able to capture global context. • It encompass best of the two model families: • Local context window method • Global matrix factorization • Earlier matrix factorization like LSA was used to reduce the dimensionality. • Two things that GloVe model captures: • Statistical measure using co-occurrence matrix • Context, by considering the neighbouring words
  • 24. Glove continued • It moves away from old matrix factorization. By considering relationship reasoning (Semantic and syntactic), GloVe tries to learn the representation for words. • It can be represented as: Word co-occurrence matrix Word feature matrix (Embedding matrix) Feature context matrix * = words context features context words features
  • 25. GloVe continued • How does GloVe learns embedding? • It considers word-word co-occurrence probabilities as the potential of relation between words. • The authors presented a relation with “steam” and “ice” as target words. • It is common to consider steam occur with gas and ice with solid. • Other co-occur words are “water” and “fashion”. “Water” has some shared property while “fashion” is irrelevant. • Only in the ratio of probabilities cancels out the noisy words like “water” and “fashion”. • As presented in the above table, the ratio of probabilities are maximum for 𝑃(𝑘/𝑖𝑐𝑒) /𝑃(𝑘/𝑠𝑡𝑒𝑎𝑚) is high for solid and small for gas.
  • 26. GloVe continued • What is the optimization function for GloVe? • In a co-occurrence matrix 𝑋 the 𝑋𝑖𝑗 represents the co-occurrence count. • 𝑋𝑖 is the total number of times the word appears in the context. • 𝑃𝑖𝑗 = 𝑃(𝑗/𝑖) = 𝑋𝑖𝑗/𝑋𝑖 is the probability of word j appear in the context of word 𝑋𝑖 • For a combination of three words 𝑖, 𝑗, 𝑘. Ageneral representation of the model is • The optimization function proposed by authors are: 𝑉 ˜ 𝑖 𝑗 𝑘 𝐹(𝑊 ,𝑊 ,𝑊 ) = 𝑃𝑖𝑘 𝑃𝑗𝑘 𝑇 ˜ 𝐽 = ∑ 𝑓(𝑋𝑖𝑗)(𝑊𝑖 𝑊𝑗 + 𝑏𝑖 + 𝑏𝑗 − log 𝑋𝑖𝑗) 𝑖,𝑗=1 2
  • 27. Glove Continued 𝑖 𝑖 • Where 𝑉 is the size of vocabulary and 𝑊𝑇 and 𝑏 is the vector and bias of the word 𝑊𝑖 and 𝑊 ˜ 𝑗 and 𝑏𝑗 is the context vector and its bias. The last term is the probability of occurring i in the context of j. • The function 𝑓(𝑋) should have following properties: • It tends to zero at when 𝑋 →0 • It should be non-decreasing so that it can discriminate rare co-occurrence instances. • It should not overweight frequent co-occurrence. • The choice of 𝑓(𝑋) is 𝑓(𝑋) = (𝑋/𝑋𝑚𝑎𝑥)𝛼 𝑖𝑓 𝑋 < 𝑋𝑚𝑎𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1
  • 28. Glove Continued • The model has following computational bottlenecks: • Creating a big co-occurrence matrix of size 𝑉𝑋 𝑉. • The model computational complexity depends on the number of non-zero elements. • During training the context window needs to be sufficiently large so that the model can distinguish left context and right context. • Words which are more distant to each other contribute less in the count. Because, distant words contribute less to the relationship of the words. • The model generates two set of 𝑊 𝑎𝑛𝑑 ˜ 𝑊. An average of both is used as the representation of words.
  • 29. Glove results t-SNE 2D projection of Glove embeddings of top 5 similar words, trained for 50 epoch on Reuters news corpus from NLTK, with context len 15, vector dimension 100 Top 5 most relevant word list: crude barrel 0.752 crude posting 0.58 crude raise 0.537 crude light 0.505 crude sour 0.502 ship loading 0.58 ship kuwaiti 0.54 ship missile 0.537 ship vessel 0.522 ship flag 0.522 computer wallace 0.595 computer software 0.592 computer microfilm 0.559 computer microchip 0.536 computer technology 0.52
  • 30. Are Word2vec and GloVe enough? • Both the embeddings can not deal with out of vocabulary words. • Both can capture the context, but in a limited sense. • They always produce single embedding for the word in cosideration. • They can’t distinguish: • “I went to a bank.” and “I was standing at a river bank.” • It will always produce single representation for both the context. • Both gives decent performance than encoding like tf-idf, count vector etc. • Does pretrained model helps the case? • Pretrained models on huge corpus shows better performance compared to small corpus. • Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanford’s website. 1. https://nlp.stanford.edu/projects/glove/ 2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
  • 31. Fasttext • It was proposed by Facebook AI team. • It was primarily meant to handle the out of vocabulary issue of GloVe and Word2vec. • It is an extension of Word2vec. • This model relies on n-gram character rather than word to generate the embeddings. • This model relies on the morphological features of a word. • The n-gram character of a word can be represented as below: • For word <where> and n=3 the word n-gram characters are: • <wh, whe, her, ere, re> • The final representation for word “Where” is the sum of the vector representation of <wh, whe, her, ere, re>.
  • 32. Fasttext continued • The modified scoring function is 𝑔∈𝐺𝑐 • Where 𝐺𝑐 is the set of n-grams for word 𝑤 and 𝑍𝑔 is the vector representation of n-gram. • The n-gram vector learning enables the model to learn the representation for out-of- vocabulary words as well. 𝑆(𝑤, 𝑐) = ∑ 𝑍𝑇 𝑉 𝑔 𝑐
  • 33. Fasttext results t-SNE 2D projection of fasttext embeddings (gensim implementation) of top 15 similar words, trained for 50 epoch on Reuters news corpus from NLTK, with context len 15, vector dimension 100 Top 5 most relevant word list: crude cruz 0.582 crude barrel 0.561 crude cruise 0.501 crude crumble 0.433 crude jude 0.41 ship shipyard 0.714 ship steamship 0.703 ship shipowner 0.688 ship shipper 0.668 ship vessel 0.667 computer supercomputer 0.843 computer computerized 0.823 computer computerland 0.773 computer software 0.54 computer microfilm 0.52
  • 34. Observation • None of the representation can capture the contextual representation. Meaning that representation based on the usage. • These are based on dictionary based look up to get the embeddings. • Limited performance on task such as question answering, summarization compare to current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
  • 35. ELMo • Language is complex, the meaning of a word can vary from context to context. • E.g. • I went to a bank to deposit some money. • I was standing at a river bank. • Both the instances of “bank” have separate meaning. • Earlier models have same meaning for the word in each scenario. • A solution would be to have multiple level of understanding of text. • A new model which can capture context : • ELMo: Embeddings from Language Model representation
  • 36. ELMo continued • It is based on deep learning framework • It generates contextualized embedding for a word. • It models complex characteristics of word (e.g. syntactic and semantic features) • It models linguistic variation contexts like polysemy. • The author argues that it can capture abstract linguistic characteristics in the higher level of layers. • It is based on bi-directional LSTM based model. • Bi-directional model helps to capture the dependency on the past words and the future words.
  • 37. ELMo Architecture • Block diagram of ELMo. 𝑊𝑜𝑟𝑑 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑥 𝑖 LSTM LSTM LSTM LSTM Softmax ⋯ ⋯ L-layers • The number of layers in original implementation was two. • Word embedding is calculated by char- CNN • Final embeddings are generated as a weighted sum of hidden layers and embedding layer. • Three different representation can be obtained • Hidden layer 1 • Hidden layer 2 • Weighted sum of hidden layer and embedding layer • Go through each block in coming slide.
  • 38. ELMo input Embedding • Input embedding is generated by a combination of layers with char CNN and highway network. Character embedding CNN with Max pool 2 layer highway network LSTM India https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
  • 39. Character CNN embedding and Highway network Embedding Layer • In first step look up is performed to get character embedding. • 1-D convolution is applied on the embedding followed by Max pooling layer • Highway network acts as a gate that determines how much original information can pass to output and how much via projection. • Final generated output is passed as input to 2-layer LSTM structure. • In original paper there were two highway layers • Character level embedding enables it to learn a representation for any word. So, it can handle out of vocabulary words as well Source: http://web.stanford.edu/class/cs224n/
  • 40. LSTM layer and Embedding • The architecture has bi-directional LSTM to predict next word from both sides. It creates biLM. Full embedding method is described in picture • Concatenate hidden layer representation 𝑘2 ℎ𝐿𝑀 𝑘1 ℎ𝐿𝑀 𝑘0 ℎ𝐿𝑀 𝐿𝑀 ℎ𝑘𝑖 𝐿𝑀 ℎ𝑘𝑖 𝑋𝑘 𝑋𝑘 ℎ𝑘,𝑗 • Two separate LSTM layers implements language model in both direction. • Forward LM predicts top layer predicts the next token using the softmax layer. • Similarly backward LM predicts the past token using the softmax score • Each of the forward L-layers of LSTM generates contextualized representation for the words 𝐿𝑀 𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . .. 𝐿
  • 41. LSTM layer and Embedding continued ℎ𝑘,𝑗 • Similarly, Each of the backward L-layers of LSTM generates contextualized 𝐿𝑀 representation for the words 𝑡𝑘 𝑤ℎ𝑒𝑟𝑒 𝑗 is 1,2,3 . . . 𝐿. • There are total 2𝐿 + 1 representation generated. 2𝐿 by the hidden layer and one by the embedding layer. • Final representation will be a weighted combination of the hidden concatenated vector and Embedding layer.
  • 42. ELMo continued • Representation of the word ‘bank’ in different context. Vectors are projected in 2D space based on the vector representation of the word ‘bank’ • It can generate different embedding for a word depending on the context. • Projection is based on the average of the hidden layer’s and embedding layer’s representation from pretrained model from tensorflow hub. • It can be tuned to perform different task like coref resolution, sentiment analysis, Q&A answering.
  • 43. ELMo continued • From the results presented by the authors, higher layer tends to capture semantic feature and lower layers captures syntactic features. • The second layer’s embedding outperform first layer’s embedding in word sense disambiguation task which is a syntactic task in nature. On the other in the POS tagging task, first layer’s embedding outperforms the second layer’s embedding.
  • 44. Bidirectional Encoder Representations from Transformers(BERT) • A transformer based model to learn contextual representation. • It is designed to pre-train deep bidirectional representations from unlabeled text by considering both left and right context. • It is pre-trained model. • It uses the concept of transfer learning, similar to what ELMo does. • Any use of BERT is based on two step process: • Train a large language model on huge amount of data, using unsupervised or semi-supervised method. • Fine tune the large model for specific NLP task. • Before we delve further into BERT. Following concepts needs to be understood. • Attention mechanism • Transformer architecture proposed by Vaswani et.al.
  • 45. Attention mechanism • This concept was brought in from computer vision task to NLP. • First use of attention mechanism was applied by Bahdanau et al. in 2015. It was based on the additive mechanism. • It is similar to the way we process an image in our brain. It is like we focus on some of the parts and infer other parts based on those information. In an image not all information gives similar information. • In sentence processing as well, we attend relevance between some of the words while other gets low attention. High level of attention I was going to a crowded market. Low level of attention
  • 46. Attention mechanism • In the above example, we tend to attend ‘market’ in the context ‘crowded’ and ‘going’ in the context of ‘market’. On the other hand, less attention to ‘going’ and ‘context’. • Attention mechanism in NLP was first employed in the neural machine translation task. • Seq-to-seq model , encode-decoder model, has an issue with longer sentence. It fails to remember relation between the distant words. • Attention mechanism was designed to capture long distance relationship between two words. • The attention mechanism can be understood as a vector of importance weights. • In the next slide, I shall try to put the basic of attention mechanism.
  • 47. Attention mechanism continued Key 1 Key 4 Key 3 Key 2 Score 1 Score 4 Score 3 Score 2 Query • The intuition can be like a query fired by us to extract some information from database. It matches against each key and generates a similarity score. • Keyi and scorei are vector of some dimension d. • The scoring function can be of different type: 𝑑 𝑖 𝑞 𝑖 𝑘 • Simple dot product 𝑞𝑇𝐾𝑖 • Scaled dot product 𝑞𝑇𝐾𝑖/ • General dot product 𝑞𝑇𝑊𝐾 • Additive dot product 𝑊𝑇 𝐾 + 𝑊𝑇 𝑞 𝑎1 𝑎4 𝑎3 𝑎2 × × × × Calculated using softmax operation 𝑣1 𝑣2 𝑣3 𝑣4 ∑ 𝑎𝑖𝑣𝑖 Attention value
  • 48. Attention mechanism continued • Final attention value is a vector. While 𝑎𝑖 and scale1 is a scaler. • The general frame work of attention mechanism generates a weightage score of the value vectors. • Attention mechanism is also known as intra-attention. • It is a attention mechanism to relate different words of a single sequence to generate a representation of the same sequence. • With the basic understanding of attention mechanism. We can move the transformer architecture architecture proposed by Vashwani et. al.
  • 49. Transformer model Image source: https://arxiv.org/pdf/1706.03762.pdf • Transformer architecture was proposed by Google AI team. • It is encode-decoder architecture • Core modules of this architecture is • Multi-head attention • Positional Encoding • Attention Mechanism • Masked multi-head attention • Residual connections • This model solves the issue with recurrent networks: • Failure to capture long distance word to word relation • RNN are sequential in nature. This architecture can be parallelized. • Each head of attention learns a different set of features. • This model does away with the recurrence.
  • 50. Input embedding and positional encoding • Embeddings are collected from some pre-trained model using dictionary look up. • Unlike RNN, which takes input sequentially. It takes whole sentence as an input. • Without positional information, it will be similar to bag of words model. • How does positional information is calculated? • Positional embedding is calculated by using the alternating combination of sin and cosine. • 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = sin and 𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = cos 𝑝𝑜𝑠 𝑝𝑜𝑠 1000 100 0 2𝑖/𝑑 2𝑖/𝑑 • Why do they use periodic function with varying frequency? • This encoding method will generate entirely different encoding for each position as well as the distance between two time steps will be consistent. • The position should be deterministic. • The author views this encoding will ensure that any 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a linear combination of 𝑃𝐸𝑝𝑜𝑠.
  • 51. Input embedding and positional encoding 𝒆𝟏 𝒆𝒑 𝟏 𝒑𝟏 + 𝒆𝟐 𝒆𝒑 𝟐 𝒑𝟐 + 𝒆𝟑 𝒆𝒑 𝟑 𝒑𝟑 + Embedding vector is combined with positional vector. The generated Embedding vector fed into attention layer. In this way, positional information is combined with the word embeddings.
  • 52. Multi-head attention. Linear Linear Linear MatMul Scale Softmax MatMul Concatenate Linear Multi-Head attention Query Key Value
  • 53. Multi-Head attention Linear Linear Linear MatMul Scale Softmax MatMul Concatenate Linear • Linear layer maps input to output as well or change the vector dimension. Its weight are learned in training process. • In Case of original paper, it was the 512 dimension was projected to 64 dim vector space. • What do we feed into query, key and value? • At the start of the training same copy of the word vectors are passed as Query(Q), Key(K) and Value(V) • The MatMul operation will generate a matrix that is precursor to attention filter. The values will be scaled by 𝑑 • Finally the softmax layer will generate the probability across all the words. • It results in a matrix that is called as “attention filter”. Query Key Value
  • 54. Multi-Head attention • The probability matrix is multiplied with 𝑉 that will generate a representation based on the attention score. • This whole process was a self-attention mechanism to generate encoding with scaled dot product scoring function. • The above explanation corresponds to single head. There can be many such heads. Original model has 8 heads. • Each head’s attention filter learns different feature. • Representation generated from the several heads are concatenated and passed to next stacked encoder. • Final representation from the encode stack is fed into decoder stack.
  • 55. Add and Norm layer • Residual connection is placed in model for • Knowledge preservation • Handling of vanishing gradient problem. • In this model, normalization part is tricky in nature. • Normalization mean and standard deviation is calculated for each of the word’s representation. • Using calculated mean and standard deviation, the layer’s value is normalized.
  • 56. Masked multi-head attention • It is an important feature of the decode stack. • Why do we need masking? • When we are generating some output. The it should not pay attention to future words because future words has not been predicted yet. So, those words are masked. • Masking take place by putting future words as -∞. It will ensure that the softmax, which is an exponent, becomes zero for the word. • Does this model have recurrence? • In a cursory glance, it appears so. • In decoder, we feed previous token as an input to the model. • But, it was trained using concept called as teacher forcing. • It means that when output is known then we can directly supply the output representation to the model. • By doing so the model can be parallelized as well. • The base model has 8 heads, with 6 layers of encode decoder stack, key and value are of dimension 64.
  • 57. Did Transformer changed the landscape in NLP? • One of the best known model GPT was based on Transformer architecture. • Current models like, BERT and its variants, GPT-2 are based on Transformer architecture. • It started to capture the space of RNN because it can be parallelized and can capture context. • It has beat the benchmark in NMT task.
  • 58. Getting back to BERT • Proposed by Devlin et.al. in 2019. • It is an encoder based model. • It is based on the transformer architecture that we discussed in previous slides. • It has only the encoder stack from the transformer architecture. • It is a unsupervised or semi-supervised pre-trained model, fine tuned for a specific task like Q&A, conversational AI etc • It is sub-word model comprises of 30000 vocabulary set. BERT tokenizer will tokenize the words. So, representation of the tokenized word may not be similar to what we have passed as input. • e.g: Word “embeddings” is tokenized into [‘em’, ‘##bed’, ‘##ding’, ‘##s’] • This approach helps to address out of vocabulary words as well. • There are two most common architecture pre-trained models are: 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒, 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒. • 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 has 12 stacked encoders and 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 has 24 stacked encoders.
  • 59. BERT Continued • The base version has 12 attention heads and ‘large’ version has 16 attention heads. • The original configuration described in the paper had 6 encoder layers and 8 attention heads. This leads to a 512 dimension representation. • How to collect the representation from BERT? • BERT uses two special token [CLS] and [SEP]. • [CLS] will always be the first token of the input. • [SEP] represents the sentence segmentation. • We need to provide the segment id as well in the input. • Similar to original transformer model, the encoded embeddings are being passed to subsequent encoders. • Each of the position will output a vector representation for token of size 768 for BERTbase and 1024 for the large model
  • 60. BERT Continued • For the classification task, we only focus on the embedding of [CLS] token. • What would be the representation for other task: • There are several variants of the embedding collection. • Considering the 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 with 12 encoder layer will generate 12 + 1 embeddings. One for the input layer. • Which layer should we use: All or some? • There are several experiments have been performed. The best performance can be obtained by concatenating the last 4 layer’s representation • Next best representative would be an average of last 4 layer’s representation. • Each layer learns different features. So, pooling strategy would be dependent upon the specific NLP task. The above two suggestions are based on the performance on NER tagging task.
  • 61. BERT Architecture • As original paper proposed, the BERT input embedding is a sum of token embedding, segment embeddings and Position embeddings Image source: https://arxiv.org/pdf/1810.04805.pdf
  • 62. BERT pre-training • BERT is pretrained using two methods: • Masked LM model: • Unlike the masking that we discussed in Transformer architecture. In this case, some random words are replaced with a special token [MASK ]. Approximately 15% of all the sub-word tokens. • Now, the prediction can be defined as a language model, given the left and right context 𝑃(𝑡𝑖|𝑡1, 𝑡2, . . , 𝑡𝑖−1, 𝑡𝑖+1, . . . , 𝑡𝑛) • Masking strategy are further divided into three parts: • 80% instances it is [MASK] token • 10% times it is a random words • 10% time there is no change
  • 63. BERT pre-training • Next sentence prediction: • This is primarily a classification task whether the next sentence follows the previous sentence or not. • The training set has 50% instance when sentence B follows sentence A and 50% are negative cases where sentence B is replaced with some random sentence. • This type of training is helpful in Q&A task. Where Question and answer are represented as pairwise. • The pre-trained BERT embeddings is new State-of-the-art word embedding representation for the most of the NLP tasks. • The original model has outperformed previous SOTA benchmarks.
  • 64. BERT attention layer visualization • Two sentence are taken as input • "Who does not like chocolate" • "Even a grown up would want to have a nice bite" • Using BERTviz tool1, visualization of attention from second sentence to first sentence. Attention by head 11 of layer 11 Attention by head 1 of layer 11
  • 65. BERT attention visualization • It appears that different heads even in same layer can capture different relation among sentences. In head 11, all words attended ‘Chocolate’ while in head 0, attention was on most of the words. • In attention between same sentence. It appears almost all the words attended ‘bite’. The attention from ‘want’, ‘have’, ‘nice’ is higher towards ‘bite’. • We observe that different layer captures different features. The type of feature captured by each head may not be understood in exact terms like syntactic, semantic etc.
  • 66. BERT pre-trained models • There are several pretrained models are available from HuggingFace1 team. • Was 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒 with >300M is big? • Can we squeeze the performance into some smaller model? • Should we train more with more layers and attention heads? • Both the questions have same answer - Yes • Two models: DitilBERT and GPT-2-XL are the answer of above question. • DistilBERT is smaller model with similar performance. • GPT-2-XL has 48 layers!!! • A better training strategy can give a better result. 𝑅𝑜𝐵𝐸𝑅𝑇𝑎𝑙𝑎𝑟𝑔𝑒 is a robustly trained version of 𝐵𝐸𝑅𝑇𝐿𝑎𝑟𝑔𝑒. Similarly, a base version of RoBERTa 1. https://huggingface.co/transformers/pretrained_models.html
  • 67. DistilBERT • Knowledge distillation is a technique under which smaller model is trained to mimic the behaviour of the larger model. • It is sometimes called as teacher student learning. Where student is smaller model and teacher is the bigger model • It was generalized by Hinton et. al. • Student is trained to learned the full distribution of the teacher. • The training of the model has a small change where the student is not trained against the gold labels but against the probabilities of the teacher 𝐿 = −∑ 𝑡𝑖 ∗ log 𝑠𝑖
  • 68. DistilBERT • How did the model was trained? • The model was trained on the basis of distillation - Leibler divergence score. • It measures the divergence between two probability distribution 𝐾𝐿(𝑡||𝑠) = ∑ 𝑝𝑖 ∗ log 𝑝𝑖 − ∑ 𝑝𝑖 ∗ log 𝑞𝑖 • The loss is the linear combination of the masked LM loss and distillation loss. • Model parameter changes: • The next sentence classification task objective was dropped compare to original version. • The number of layers was reduced by a factor of two. • Did it affected the performance? • Yes, it did. Still, it can able to retain 95% performance of the original BERT
  • 69. DitilBERT • Another trick to capture the performance of the teacher was ‘weight initialization’ . The layer’s weight was initialized with already learned weights from the master. • It was trained on the larger batches with Masked Language model similar to the original BERT method. • Visualization of attention in DistilBERT • In this case as well the word ‘chocolate’ was attended by relevant words like ‘bite’ etc.
  • 70. RoBERTa • It was proposed by Facebook AI team. • It is a training strategy to learn a better representation compare to BERT. • Two important points that was compared to the original pre-training. • Static masking vs dynamic masking: In original paper, [MASK] token was statically changed before training. In this work, the data was duplicated 10 times so that different masking pattern can be observed several times in same context. This did not improve the result. • Training with higher batch size compared to original pre training lead to a better accuracy. • This model dropped the next sentence prediction objective as compared to original BERT.
  • 71. RoBERTa • This model was trained on huge amount of data ~160GB. • It was trained on a batch size of 8K compared to batch size of 256 in BERT. • Last one, it was trained for longer duration. • It was able to beat the SOTA in different task. On GLUE, it surpass the model XLnet. It appears that a proper training can give a better result.
  • 72. RoBERTa Visualization • Architecturally it is similar to BERT. Attention head 4 of layer 5 attended the word ‘chocolate’ by ‘like’ and ‘bite’. It is similar to human understanding of it. • Layer 4 has captured similar attention like layer 10 of BERT that we have already seen. • This presents the view that different features are captured at different layers.
  • 73. sBERT • It is a tuned BERT model to generate better representation for sentences. So that performance on common similarity measures can be improved. • In seven semantic textual similarity task, even GloVe representation performed better than the Average of BERT encoding. • This model fine tunes the BERT for sentence similarity. • It is trained using Siamese and triplet network. • In Siamese network, two network with same architecture are place with tied weights. • The model is tuned models are task specific. • Classification task: • In this task the learned representation from the Siamese network on BERT 𝑢 𝑎𝑛𝑑 𝑣 are concatenated with element wise difference of 𝑢 𝑎𝑛𝑑 𝑣 𝑖. 𝑒. |𝑢 − 𝑣|.
  • 74. sBERT • The concatenated representation is multiplied with a matrix 𝑊𝑡 . This matrix’s weight is learned to increase the classification accuracy. • Other two scenarios are • Regression task where last two layers are replaced with cosine similarity between two vectors. The objective function is Mean squared error. • Tripple objective function is applied when sentence 𝑎 has a positive relation with sentence 𝑝 and negative relation with 𝑞 then the Loss function tries to place a closer to p while farther to q. Image source: https://arxiv.org/pdf/1908.10084.pdf
  • 75. sBERT training • Model is trained with different hyper parameter and adopted strategies: • Pooling strategy: • In this case they tried to pool BERT embedding for sentence representation using three strategies - MAX, MEAN and [CLS] • MEAN shows the best performance. • For classification task the vectors 𝑃 𝑎𝑛𝑃 𝑃 were concatenated in different ways. But they achieved the best performance when |𝑃 − 𝑃| was concatenated with 𝑃 𝑎𝑛𝑃 𝑃. • Observation: • The need of fine tuning is required and task specific. • Transfer learning of BERT can be used. This is fine tuning of BERT for a specific task.
  • 76. Pre-trained models • Most of major models are very huge in nature and take a lot of computational resource to train. These models are open sourced by researchers or Tech companies. • It enables other researcher to use transfer learning to fine tune it for the task. • Recent model in GPT series (GPT-3) has not been open sourced. • There is another version of ELMo based on transformer architecture has also been launched. • As the models are getting heavier and heavier, a model by nVIDIA (Megatron LM) has 8300 M parameter!! • So, Word Embeddings are still evolving. But, BERT and ELMo were the ‘VGG16’ moment for NLP!!!