SlideShare a Scribd company logo
1 of 13
Download to read offline
Machine Learning Explained
Deep learning, python, data wrangling and other machine learning related topics explained for
practitioners
Paper Dissected: “Glove: Global
Vectors for Word Representation”
Explained
Pre-trained word embeddings are a staple in deep learning for NLP. The pioneer
of word embeddings in mainstream deep learning is the renowned word2vec.
GloVe is another commonly used method of obtaining pre-trained
embeddings. Unfortunately, there are very few practitioners that seem to actually
understand GloVe; many just consider it “another word2vec”. In reality, GloVe is
a much more principled approach to word embeddings that provides deep
insights into word embeddings in general.
GloVe is a must-read paper for any self-respecting NLP engineer that can seem
intimidating at rst glance. This post aims to dissect and explain the paper for
engineers and highlight the di erences and similarities between GloVe and
word2vec .
 
TL;DR
GloVe aims to achieve two goals:
(1) Create word vectors that capture meaning in vector space
(2) Takes advantage of global count statistics instead of only local
information
Unlike word2vec – which learns by streaming sentences – GloVe learns based
on a co-occurrence matrix and trains word vectors so their di erences
predict co-occurrence ratios
GloVe weights the loss based on word frequency
Somewhat surprisingly, word2vec and GloVe turn out to be extremely similar,
despite starting o from entirely di erent starting points
Motivation and Brief Review of Word Embeddings
The Problem with Word2vec
GloVe was published after word2vec so the natural question to ask is: why is
word2vec not enough?
In case you are not familiar, here’s a quick explanation of word2vec. Word2vec
trains word embeddings by optimizing a loss function with gradient descent, just
like any other deep learning model. In word2vec, the loss function is computed by
measuring how well a certain word can predict its surroundings words. For
instance, take the sentence
“The cat sat on the mat”.
Suppose the context window is size 2, and the word we are focusing on is “sat”.
The context words are “the”, “cat”, “on” and “the”. Word2vec tries to either
predict the word in focus from the context words (this is called the CBOW model)
or the context words using the word in focus (this is called the Skip-gram model).
I won’t go into the details here since there are plenty of explanations online. The
important thing to note is that word2vec only takes local contexts into account.
It does not take advantage of global count statistics. For example, “the” and
“cat” might be used together often, but word2vec does not know if this is because
“the” is a common word or if this is because the words “the” and “cat” have a
strong linkage (well actually, it indirectly does, but this is a topic I’ll cover later
on in this post). This is the motivation for using global count statistics.
The problem with global count statistic based methods
Before word2vec, using matrices that contained global count information was one
of the major ways of converting words to vectors (yes, the idea of word
vectors/embeddings existed far before word2vec). For instance, Latent Semantic
Analysis (LSA) computed word embeddings by decomposing term-document
matrices using Singular Value Decomposition. Though these methods take
advantage of global information, the obtained vectors do not show the same
behavior as those obtained by word2vec. For instance, in word2vec, word
analogies can be expressed in terms of simple (vector) arithmetic such as in the
case of “king – man + woman = queen”.
This behavior is desirable because it shows that the vectors capture dimensions
of meaning. Some dimensions of the vectors might capture whether the word is
male or female, present tense or past tense, plural or singular, etc.. This means
that downstream models can easily extract the meaning from these vectors,
which is what we really want to achieve when training word vectors.
 
GloVe aims to take the best of both worlds: take global information into account
while learning dimensions of meaning. From this initial goal, GloVe builds up a
principled method of training word vectors.
Building GloVe, Step by Step
Data Preparation
In word2vec, the vectors were learned so that they could achieve the task of
predicting surrounding words in a sentence.  During training, word2vec streams
the sentences and computes the loss for each batch of words.
In GloVe, the authors take a more principled approach. The rst step is to build a
co-occurrence matrix. GloVe also takes local context into account by computing
the co-occurrence matrix using a xed window size (words are deemed to co-
occur when they appear together within a xed window). For instance, the
sentence
“The cat sat on the mat”
with a window size of  2 would be converted to the co-occurrence matrix
the cat sat on
ma
t
the 2 1 2 1 1
cat 1 1 1 1 0
sat 2 1 1 1 0
on 1 1 1 1 1
mat 1 0 0 1 1
Notice how the matrix is symmetric: this is because when the word “cat” appears
in the context of “sat”, the opposite (the word “sat” appearing in the context
of”cat”) also happens.
What Should we Predict?
Now, the question is how to connect the vectors with the statistics computed
above. The underlying principle behind GloVe can be stated as follows: the co-
occurrence ratios between two words in a context are strongly connected to
meaning.
This sounds di cult but the idea is really simple. Take the words “ice” and
“steam”, for instance. Ice and steam di er in their state but are the same in that
they are both forms of water. Therefore, we would expect words related to water
(like “water” and “wet”) to appear equally in the context of “ice” and “steam”.
In contrast, words like “cold” and “solid” would probably appear near “ice” but
would not appear near “steam”.
The following table shows the actual statistics that show this intuition well:
The probabilities shown here are basically just counts of how often the word
appears when the words “ice” and “steam” are in the context, where refers to
the words “solid”, “gas”, “water”, and “fashion”. As you can see, words that are
related to the nature of “ice” and “steam” (“solid” and “gas” respectively) occur
far more often with their corresponding words that the non-corresponding word.
In contrast, words like “water” and “fashion” which are not particularly related
to either have a probability ratio near 1. Note that the probability ratios can be
computed easily using the co-occurrence matrix.
 
From here on, we’ll need a bit of mathematical notation to make the explanation
easier. We’ll use to refer to the co-occurrence matrix and to refer to the
th element in which is equal to the number of times word appears in the
context of word . We’ll also de ne to refer to the total number of
words that have appeared in the context of . Other notation will be de ned as we
go along.
Deriving the GloVe Equation
(This subsection is not directly necessary to understand how GloVe and word2vec
di er, etc. so feel free to skip it if you are not interested.)
Now, it seems clear that we should be aiming to predict the co-occurrence ratios
using the word vectors. For now, we’ll express the relation between the ratios
with the following equation:
Here,  refers to the probability of the word appearing in the context of , and
can be computed as
.
is some unknown function that takes the embeddings for the words as
input. Notice that there are two kinds of embeddings: input and output
(expressed as and ) for the context and focus words. This is a relatively minor
detail but is important to keep in mind.
Now, the question becomes what should be.  If you recall, one of the goals of
GloVe was to create vectors with meaningful dimensions that expressed meaning
using simple arithmetic (addition and subtraction). We should choose so that
the vectors it leads to meet this property.
Since we want simple arithmetic between the vectors to have meaning, it’s only
natural that we make the input to the function also be the result of arithmetic
between vectors. The simplest way to do this is by making the input to the
di erence between the vectors we’re comparing:
Now, we want create a linear relation between and . This can be
accomplished by using the dot product:
Now, we’ll use two tricks to determine and simplify this equation:
1. By taking the log of the probability ratios, we can convert the ratio into a
subtraction between probabilities
2. By adding a bias term for each word, we can capture the fact that some words
just occur more often than others.
These two tricks give us the following equation:
We can convert this equation into an equation over a single entry in the co-
occurrence matrix.
By absorbing the nal term on the right-hand side into the bias term, and adding
an output bias for symmetry, we get
And here we are. This is the core equation behind GloVe.
 
Weighting occurrences
There is one problem with the equation above: it weights all co-occurrences
equally. Unfortunately, not all co-occurrences have the same quality of
information. Co-occurrences that are infrequent will tend to be noisy and
unreliable, so we want to weight frequent co-occurrences more heavily. On the
other hand, we don’t want co-occurrences like “it is” dominating the loss, so we
don’t want to weight too heavily based on frequency.
Through experimentation, the authors of the paper found the following
weighting function to perform relatively well:
The function becomes easier to imagine when it is plotted:
Essentially, the function gradually increases with but does not become any
larger than 1. Using this function, the loss becomes
 
Comparison with Word2Vec
Though GloVe and word2vec use completely di erent methods for optimization,
they are actually surprisingly mathematically similar. This is because, although
word2vec does not explicitly decompose a co-occurrence matrix, it implicitly
optimizes over one by streaming over the sentences.
Word2vec optimizes the log likelihood of seeing words in the same context
windows together, where the probability of seeing word in the context of is
estimated as
The loss function can be expressed as
The number of times is added to the loss is equivalent to the number of
times the pair appears in the corpus. Therefore, this loss is equivalent to
Now, if we recall, the actual probability of word appearing in the context of is
Meaning we can group over the context word and get the following equation:
Notice how the term takes the form of the cross-entropy loss. In
other words, the loss of word2vec is actually just a frequency weighted sum over
the cross-entropy between the predicted word distribution and the actual word
distribution in the context of the word . The weighting factor comes from the
fact that we stream across all the data equally, meaning a word that appears
times will contribute to the loss times. There is no inherent reason to stream
across all the data equally though. In fact, the word2vec paper actually argues
that ltering the data and reducing updates from frequent context words
improves performance. This means that the weighting factor can be an arbitrary
function of frequency, leading to the loss function
Notice how this only di ers from the GloVe equation we derived above in terms
of the loss measure between the actual observed frequencies and the predicted
frequencies. Whereas GloVe uses the log mean squared error between the
predicted (unnormalized) probabilities and the actual observed (unnormalized)
probabilities, word2vec uses the cross-entropy loss over the normalized
probabilities.
The GloVe paper argues that log mean squared error is better than cross-entropy
because cross entropy tends to put too much weight on the long tails. In reality,
this is probably a subject that needs more analysis. The essential thing to know is
that word2vec and GloVe are actually virtually mathematically the same, despite
having entirely di erent derivations!
 
Computational Complexity
If word2vec and GloVe are mathematically similar, the natural thing to ask is if
one is faster than the other.
GloVe requires computing the co-occurrence matrix, but since both methods
need to build the vocabulary before optimization and the co-occurrence matrix
can be built during this process, this is not the de ning factor in the
computational complexity.
Instead, the computational complexity of GloVe is proportionate to the number of
non-zero elements of . In contrast, the computational complexity of word2vec
is proportionate to the size of the corpus .
Though I won’t go into the details, the frequency of a word pair can be
approximated based on its frequency rank (the ranking of the word pair when
all words pairs are sorted by frequency):
Using this, we can approximate the number of non-zero elements in as
if else .
Essentially, if some words occur very frequently it’s faster to optimize over the
statistics rather than to iterate over all the entire corpus repeatedly.
 
Results
Though the GloVe paper presents many experimental results, I’ll focus the results
it presents in terms of comparison with word2vec here.
The authors show that GloVe consistently produces better embeddings faster
than word2vec. This makes sense, given how GloVe is much more principled in its
approach to word embeddings.
Unfortunately, empirical results published after the GloVe paper seem to favor
the conclusion that GloVe and word2vec perform roughly the same for many
downstream tasks. Furthermore, most people use pre-trained word embeddings
so the training time is not much of an advantage. Therefore, GloVe should be one
choice for pre-trained word embeddings, but should not be the only choice.
 
Conclusion and Further Readings
GloVe is not just “another word2vec”; it is a more systematic approach to word
embeddings that sheds light on the behavior of word2vec and embeddings in
general. I hope this post has given the reader some more insight into this
outstanding and deep paper.
Here are some further readings for those interested:
The original Glove paper
The word2vec paper
 
Other papers dissected:
Attention is All You Need
Deep Image Prior
Quasi-Recurrent Neural Networks
Share this:
 
Like this:
Like
Be the first to like this.
An Overview of Sentence Embedding Methods
December 28, 2017
In "Deep Learning"
Related
Machine Learning Explained Proudly powered by WordPress
keitakurita April 29, 2018 Deep Learning, NLP, Paper
Paper Dissected: "BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding" Explained
January 7, 2019
In "Deep Learning"
Paper Dissected: "Deep Contextualized Word Representations" Explained
June 15, 2018
In "Deep Learning"
/ /
/

More Related Content

What's hot

Magicni Vjetar Knjiga 04.pdf
Magicni Vjetar Knjiga 04.pdfMagicni Vjetar Knjiga 04.pdf
Magicni Vjetar Knjiga 04.pdfzoran radovic
 
Degree Certificate
Degree CertificateDegree Certificate
Degree CertificateVel Durai
 
الاجتماعيات للصف الرابع الابتدائي
 الاجتماعيات للصف الرابع الابتدائي الاجتماعيات للصف الرابع الابتدائي
الاجتماعيات للصف الرابع الابتدائيAyad Haris Beden
 
The Cambridge Grammar of the English Language
The Cambridge Grammar of the English LanguageThe Cambridge Grammar of the English Language
The Cambridge Grammar of the English LanguageJairo Caetano
 
Learn to speak english in 100 days urdu pdf book
Learn to speak english in 100 days urdu pdf bookLearn to speak english in 100 days urdu pdf book
Learn to speak english in 100 days urdu pdf bookWaqas Ali
 
จักรวาลในเปลือกนัท
จักรวาลในเปลือกนัทจักรวาลในเปลือกนัท
จักรวาลในเปลือกนัทnsumato
 
Magicni Vjetar Knjiga 05.pdf
Magicni Vjetar Knjiga 05.pdfMagicni Vjetar Knjiga 05.pdf
Magicni Vjetar Knjiga 05.pdfzoran radovic
 
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdf
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdfشرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdf
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdfسمير بسيوني
 
Die schwersten 100 Tage als Führungskraft souverän meistern.
Die schwersten 100 Tage als Führungskraft souverän meistern.Die schwersten 100 Tage als Führungskraft souverän meistern.
Die schwersten 100 Tage als Führungskraft souverän meistern.Michael Holub
 
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録長野市議会議員小泉一真
 
قواعد اللغة العربية للرابع الابتدائي
قواعد اللغة العربية للرابع الابتدائيقواعد اللغة العربية للرابع الابتدائي
قواعد اللغة العربية للرابع الابتدائيAyad Haris Beden
 
Salinas OJT Certificate of Completion
Salinas OJT Certificate of CompletionSalinas OJT Certificate of Completion
Salinas OJT Certificate of CompletionJan Rei Datinguinoo
 
Catalogue 12 arvea nature tunisie
Catalogue 12 arvea nature tunisieCatalogue 12 arvea nature tunisie
Catalogue 12 arvea nature tunisieSameh50
 
Magicni Vjetar Knjiga 03.pdf
Magicni Vjetar Knjiga 03.pdfMagicni Vjetar Knjiga 03.pdf
Magicni Vjetar Knjiga 03.pdfzoran radovic
 
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارس
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارسالقرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارس
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارسElsayed Aboulila
 
Intermediate Marksheet
Intermediate MarksheetIntermediate Marksheet
Intermediate MarksheetAbdul Samad
 

What's hot (20)

Magicni Vjetar Knjiga 04.pdf
Magicni Vjetar Knjiga 04.pdfMagicni Vjetar Knjiga 04.pdf
Magicni Vjetar Knjiga 04.pdf
 
Degree Certificate
Degree CertificateDegree Certificate
Degree Certificate
 
40 Rohani Ilaj Ma Tibbi Ilaaj
40 Rohani Ilaj Ma Tibbi Ilaaj40 Rohani Ilaj Ma Tibbi Ilaaj
40 Rohani Ilaj Ma Tibbi Ilaaj
 
الاجتماعيات للصف الرابع الابتدائي
 الاجتماعيات للصف الرابع الابتدائي الاجتماعيات للصف الرابع الابتدائي
الاجتماعيات للصف الرابع الابتدائي
 
The Cambridge Grammar of the English Language
The Cambridge Grammar of the English LanguageThe Cambridge Grammar of the English Language
The Cambridge Grammar of the English Language
 
Learn to speak english in 100 days urdu pdf book
Learn to speak english in 100 days urdu pdf bookLearn to speak english in 100 days urdu pdf book
Learn to speak english in 100 days urdu pdf book
 
Quran with Tajwid Surah 4 ﴾القرآن سورۃ النساء﴿ An-Nisa' 🙪 PDF
Quran with Tajwid Surah 4 ﴾القرآن سورۃ النساء﴿ An-Nisa' 🙪 PDFQuran with Tajwid Surah 4 ﴾القرآن سورۃ النساء﴿ An-Nisa' 🙪 PDF
Quran with Tajwid Surah 4 ﴾القرآن سورۃ النساء﴿ An-Nisa' 🙪 PDF
 
จักรวาลในเปลือกนัท
จักรวาลในเปลือกนัทจักรวาลในเปลือกนัท
จักรวาลในเปลือกนัท
 
Magicni Vjetar Knjiga 05.pdf
Magicni Vjetar Knjiga 05.pdfMagicni Vjetar Knjiga 05.pdf
Magicni Vjetar Knjiga 05.pdf
 
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdf
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdfشرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdf
شرح طيبة النشر لموسى بن جار الله طبعة قديمة.pdf
 
Die schwersten 100 Tage als Führungskraft souverän meistern.
Die schwersten 100 Tage als Führungskraft souverän meistern.Die schwersten 100 Tage als Führungskraft souverän meistern.
Die schwersten 100 Tage als Führungskraft souverän meistern.
 
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録
#公園廃止 は1軒のクレームから: 青木島遊園地廃止を区長会12人のうち1/3欠席で決定 令和3年12月14日会議子ども政策課記録
 
قواعد اللغة العربية للرابع الابتدائي
قواعد اللغة العربية للرابع الابتدائيقواعد اللغة العربية للرابع الابتدائي
قواعد اللغة العربية للرابع الابتدائي
 
kcse cert
kcse certkcse cert
kcse cert
 
Surah An Naml
Surah An NamlSurah An Naml
Surah An Naml
 
Salinas OJT Certificate of Completion
Salinas OJT Certificate of CompletionSalinas OJT Certificate of Completion
Salinas OJT Certificate of Completion
 
Catalogue 12 arvea nature tunisie
Catalogue 12 arvea nature tunisieCatalogue 12 arvea nature tunisie
Catalogue 12 arvea nature tunisie
 
Magicni Vjetar Knjiga 03.pdf
Magicni Vjetar Knjiga 03.pdfMagicni Vjetar Knjiga 03.pdf
Magicni Vjetar Knjiga 03.pdf
 
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارس
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارسالقرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارس
القرار الوزارى رقم 525 بشان تنظيم عمل مجموعات التقوية بالمدارس
 
Intermediate Marksheet
Intermediate MarksheetIntermediate Marksheet
Intermediate Marksheet
 

Similar to Paper dissected glove_ global vectors for word representation_ explained _ machine learning explained

[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data scienceNikhil Jaiswal
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaperUrjit Patel
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...IRJET Journal
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements LabelingData Works MD
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your VacationTJ Stalcup
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...禎晃 山崎
 

Similar to Paper dissected glove_ global vectors for word representation_ explained _ machine learning explained (20)

[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data science
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaper
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your Vacation
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Paper dissected glove_ global vectors for word representation_ explained _ machine learning explained

  • 1. Machine Learning Explained Deep learning, python, data wrangling and other machine learning related topics explained for practitioners Paper Dissected: “Glove: Global Vectors for Word Representation” Explained Pre-trained word embeddings are a staple in deep learning for NLP. The pioneer of word embeddings in mainstream deep learning is the renowned word2vec. GloVe is another commonly used method of obtaining pre-trained embeddings. Unfortunately, there are very few practitioners that seem to actually understand GloVe; many just consider it “another word2vec”. In reality, GloVe is a much more principled approach to word embeddings that provides deep insights into word embeddings in general. GloVe is a must-read paper for any self-respecting NLP engineer that can seem intimidating at rst glance. This post aims to dissect and explain the paper for engineers and highlight the di erences and similarities between GloVe and word2vec .   TL;DR GloVe aims to achieve two goals: (1) Create word vectors that capture meaning in vector space (2) Takes advantage of global count statistics instead of only local information Unlike word2vec – which learns by streaming sentences – GloVe learns based on a co-occurrence matrix and trains word vectors so their di erences predict co-occurrence ratios
  • 2. GloVe weights the loss based on word frequency Somewhat surprisingly, word2vec and GloVe turn out to be extremely similar, despite starting o from entirely di erent starting points Motivation and Brief Review of Word Embeddings The Problem with Word2vec GloVe was published after word2vec so the natural question to ask is: why is word2vec not enough? In case you are not familiar, here’s a quick explanation of word2vec. Word2vec trains word embeddings by optimizing a loss function with gradient descent, just like any other deep learning model. In word2vec, the loss function is computed by measuring how well a certain word can predict its surroundings words. For instance, take the sentence “The cat sat on the mat”. Suppose the context window is size 2, and the word we are focusing on is “sat”. The context words are “the”, “cat”, “on” and “the”. Word2vec tries to either predict the word in focus from the context words (this is called the CBOW model) or the context words using the word in focus (this is called the Skip-gram model).
  • 3. I won’t go into the details here since there are plenty of explanations online. The important thing to note is that word2vec only takes local contexts into account. It does not take advantage of global count statistics. For example, “the” and “cat” might be used together often, but word2vec does not know if this is because “the” is a common word or if this is because the words “the” and “cat” have a strong linkage (well actually, it indirectly does, but this is a topic I’ll cover later on in this post). This is the motivation for using global count statistics. The problem with global count statistic based methods Before word2vec, using matrices that contained global count information was one of the major ways of converting words to vectors (yes, the idea of word vectors/embeddings existed far before word2vec). For instance, Latent Semantic Analysis (LSA) computed word embeddings by decomposing term-document matrices using Singular Value Decomposition. Though these methods take advantage of global information, the obtained vectors do not show the same behavior as those obtained by word2vec. For instance, in word2vec, word analogies can be expressed in terms of simple (vector) arithmetic such as in the case of “king – man + woman = queen”. This behavior is desirable because it shows that the vectors capture dimensions of meaning. Some dimensions of the vectors might capture whether the word is male or female, present tense or past tense, plural or singular, etc.. This means that downstream models can easily extract the meaning from these vectors, which is what we really want to achieve when training word vectors.  
  • 4. GloVe aims to take the best of both worlds: take global information into account while learning dimensions of meaning. From this initial goal, GloVe builds up a principled method of training word vectors. Building GloVe, Step by Step Data Preparation In word2vec, the vectors were learned so that they could achieve the task of predicting surrounding words in a sentence.  During training, word2vec streams the sentences and computes the loss for each batch of words. In GloVe, the authors take a more principled approach. The rst step is to build a co-occurrence matrix. GloVe also takes local context into account by computing the co-occurrence matrix using a xed window size (words are deemed to co- occur when they appear together within a xed window). For instance, the sentence “The cat sat on the mat” with a window size of  2 would be converted to the co-occurrence matrix the cat sat on ma t the 2 1 2 1 1 cat 1 1 1 1 0 sat 2 1 1 1 0 on 1 1 1 1 1 mat 1 0 0 1 1 Notice how the matrix is symmetric: this is because when the word “cat” appears in the context of “sat”, the opposite (the word “sat” appearing in the context of”cat”) also happens.
  • 5. What Should we Predict? Now, the question is how to connect the vectors with the statistics computed above. The underlying principle behind GloVe can be stated as follows: the co- occurrence ratios between two words in a context are strongly connected to meaning. This sounds di cult but the idea is really simple. Take the words “ice” and “steam”, for instance. Ice and steam di er in their state but are the same in that they are both forms of water. Therefore, we would expect words related to water (like “water” and “wet”) to appear equally in the context of “ice” and “steam”. In contrast, words like “cold” and “solid” would probably appear near “ice” but would not appear near “steam”. The following table shows the actual statistics that show this intuition well: The probabilities shown here are basically just counts of how often the word appears when the words “ice” and “steam” are in the context, where refers to the words “solid”, “gas”, “water”, and “fashion”. As you can see, words that are related to the nature of “ice” and “steam” (“solid” and “gas” respectively) occur far more often with their corresponding words that the non-corresponding word. In contrast, words like “water” and “fashion” which are not particularly related to either have a probability ratio near 1. Note that the probability ratios can be computed easily using the co-occurrence matrix.   From here on, we’ll need a bit of mathematical notation to make the explanation easier. We’ll use to refer to the co-occurrence matrix and to refer to the th element in which is equal to the number of times word appears in the context of word . We’ll also de ne to refer to the total number of
  • 6. words that have appeared in the context of . Other notation will be de ned as we go along. Deriving the GloVe Equation (This subsection is not directly necessary to understand how GloVe and word2vec di er, etc. so feel free to skip it if you are not interested.) Now, it seems clear that we should be aiming to predict the co-occurrence ratios using the word vectors. For now, we’ll express the relation between the ratios with the following equation: Here,  refers to the probability of the word appearing in the context of , and can be computed as . is some unknown function that takes the embeddings for the words as input. Notice that there are two kinds of embeddings: input and output (expressed as and ) for the context and focus words. This is a relatively minor detail but is important to keep in mind. Now, the question becomes what should be.  If you recall, one of the goals of GloVe was to create vectors with meaningful dimensions that expressed meaning using simple arithmetic (addition and subtraction). We should choose so that the vectors it leads to meet this property. Since we want simple arithmetic between the vectors to have meaning, it’s only natural that we make the input to the function also be the result of arithmetic between vectors. The simplest way to do this is by making the input to the di erence between the vectors we’re comparing:
  • 7. Now, we want create a linear relation between and . This can be accomplished by using the dot product: Now, we’ll use two tricks to determine and simplify this equation: 1. By taking the log of the probability ratios, we can convert the ratio into a subtraction between probabilities 2. By adding a bias term for each word, we can capture the fact that some words just occur more often than others. These two tricks give us the following equation: We can convert this equation into an equation over a single entry in the co- occurrence matrix. By absorbing the nal term on the right-hand side into the bias term, and adding an output bias for symmetry, we get And here we are. This is the core equation behind GloVe.   Weighting occurrences There is one problem with the equation above: it weights all co-occurrences equally. Unfortunately, not all co-occurrences have the same quality of information. Co-occurrences that are infrequent will tend to be noisy and unreliable, so we want to weight frequent co-occurrences more heavily. On the
  • 8. other hand, we don’t want co-occurrences like “it is” dominating the loss, so we don’t want to weight too heavily based on frequency. Through experimentation, the authors of the paper found the following weighting function to perform relatively well: The function becomes easier to imagine when it is plotted: Essentially, the function gradually increases with but does not become any larger than 1. Using this function, the loss becomes   Comparison with Word2Vec Though GloVe and word2vec use completely di erent methods for optimization, they are actually surprisingly mathematically similar. This is because, although word2vec does not explicitly decompose a co-occurrence matrix, it implicitly optimizes over one by streaming over the sentences. Word2vec optimizes the log likelihood of seeing words in the same context windows together, where the probability of seeing word in the context of is estimated as
  • 9. The loss function can be expressed as The number of times is added to the loss is equivalent to the number of times the pair appears in the corpus. Therefore, this loss is equivalent to Now, if we recall, the actual probability of word appearing in the context of is Meaning we can group over the context word and get the following equation: Notice how the term takes the form of the cross-entropy loss. In other words, the loss of word2vec is actually just a frequency weighted sum over the cross-entropy between the predicted word distribution and the actual word distribution in the context of the word . The weighting factor comes from the fact that we stream across all the data equally, meaning a word that appears times will contribute to the loss times. There is no inherent reason to stream across all the data equally though. In fact, the word2vec paper actually argues that ltering the data and reducing updates from frequent context words improves performance. This means that the weighting factor can be an arbitrary function of frequency, leading to the loss function Notice how this only di ers from the GloVe equation we derived above in terms of the loss measure between the actual observed frequencies and the predicted frequencies. Whereas GloVe uses the log mean squared error between the predicted (unnormalized) probabilities and the actual observed (unnormalized)
  • 10. probabilities, word2vec uses the cross-entropy loss over the normalized probabilities. The GloVe paper argues that log mean squared error is better than cross-entropy because cross entropy tends to put too much weight on the long tails. In reality, this is probably a subject that needs more analysis. The essential thing to know is that word2vec and GloVe are actually virtually mathematically the same, despite having entirely di erent derivations!   Computational Complexity If word2vec and GloVe are mathematically similar, the natural thing to ask is if one is faster than the other. GloVe requires computing the co-occurrence matrix, but since both methods need to build the vocabulary before optimization and the co-occurrence matrix can be built during this process, this is not the de ning factor in the computational complexity. Instead, the computational complexity of GloVe is proportionate to the number of non-zero elements of . In contrast, the computational complexity of word2vec is proportionate to the size of the corpus . Though I won’t go into the details, the frequency of a word pair can be approximated based on its frequency rank (the ranking of the word pair when all words pairs are sorted by frequency): Using this, we can approximate the number of non-zero elements in as if else .
  • 11. Essentially, if some words occur very frequently it’s faster to optimize over the statistics rather than to iterate over all the entire corpus repeatedly.   Results Though the GloVe paper presents many experimental results, I’ll focus the results it presents in terms of comparison with word2vec here. The authors show that GloVe consistently produces better embeddings faster than word2vec. This makes sense, given how GloVe is much more principled in its approach to word embeddings. Unfortunately, empirical results published after the GloVe paper seem to favor the conclusion that GloVe and word2vec perform roughly the same for many downstream tasks. Furthermore, most people use pre-trained word embeddings so the training time is not much of an advantage. Therefore, GloVe should be one choice for pre-trained word embeddings, but should not be the only choice.   Conclusion and Further Readings
  • 12. GloVe is not just “another word2vec”; it is a more systematic approach to word embeddings that sheds light on the behavior of word2vec and embeddings in general. I hope this post has given the reader some more insight into this outstanding and deep paper. Here are some further readings for those interested: The original Glove paper The word2vec paper   Other papers dissected: Attention is All You Need Deep Image Prior Quasi-Recurrent Neural Networks Share this:   Like this: Like Be the first to like this. An Overview of Sentence Embedding Methods December 28, 2017 In "Deep Learning" Related
  • 13. Machine Learning Explained Proudly powered by WordPress keitakurita April 29, 2018 Deep Learning, NLP, Paper Paper Dissected: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" Explained January 7, 2019 In "Deep Learning" Paper Dissected: "Deep Contextualized Word Representations" Explained June 15, 2018 In "Deep Learning" / / /