SlideShare a Scribd company logo
1 of 5
Download to read offline
[EMNLP] What is GloVe? Part II
An introduction to unsupervised learning of word embeddings from
co-occurrence matrices.
Brendan Whitaker
May 25, 2018 · 5 min read
In this article, we’ll discuss one of the newer methods of creating vector space models
of word semantics, more commonly known as word embeddings. The original paper by
J. Pennington, R. Socher, and C. Manning is available here:
http://www.aclweb.org/anthology/D14-1162.This method combines elements from
the two main word embedding models which existed when GloVe, short for “Global
Vectors [for word representation]” was proposed: global matrix factorization and local
context window methods. In Part I, we compared these two different approaches. Now
we’ll give an explanation of the GloVe embedding generation algorithm and how it
improves on these previous methods.
. . .
Co-occurrence probabilities.
Recall from the Part I of this series that term-term frequency matrices encode how
often terms appear in the context of one another by enumerating each unique token in
the corpus along both of the axes of a large 2-dimensional matrix. Performing matrix
factorization gives us a low rank approximation of the whole of the data contained in
the original matrix. However, as we’ll explain in a moment, the authors of GloVe
discovered via empirical methods that instead of learning the raw co-occurrence
probabilities, it may make more sense to learn ratios of these co-occurrence
probabilities, which seem to better discriminate subtleties in term-term relevance.
To illustrate this, we borrow an example from their paper: suppose we wish to study
the relationship between two words, i = ice and j = steam. We’ll do this by examining
the co-occurrence probabilities of these words with various “probe” words. We define
the co-occurrence probability of an arbitrary word i with an arbitrary word j to be the
probability that word j appears in the context of word i. This is represented by the
equation and definitions below.
Note X_i is defined as the number of times any word appears in the context of word i,
so it’s defined as the sum over all words k of the number of times word k occurs in the
context of word i. So if we choose a probe word k = solid which is closely related to i =
ice but not to j = steam, we expect the ratio P_{ik}/P_{jk} of co-occurrence
probabilities to be large, since solid should, in theory, appear in the context of ice more
often than it would appear in the context of steam, since ice is a solid and steam is
not.Conversely, for a choice of k = gas, we would expect the same ratio to be small,
since steam is more closely related to gas than ice is. Then we also have words like
water, which are closely related to both ice and steam, but not more to one than the
other. And we also have words like fashion which are not closely related to either of the
words in question. For both water and fashion, we expect our ratio to be close to 1
since there shouldn’t be any bias to one of ice or steam.
Now it is important to note that since we are trying to determine information about the
relationship between the words ice and steam, water doesn’t give us a lot of useful
information. For discriminative purposes, it doesn’t give us a good idea of how “far
apart” steam is from ice, and the information that steam, ice, and water are all related
is already captured in the discriminative information between ice and water, and
steam and water. Words that don’t help us distinguish between i and j are referred to
as noise points, and it is the use of the ratio between co-occurrence probabilities helps
filter out these noise points. This is well-illustrated by the real data for these example
points, which we print here from Table 1 in the GloVe paper.
source
So let’s take a step back for a moment and recall the overall structure of the problem.
We want to take data from the corpus in the form of global statistics and learn a
function that gives us information about the relationship between any two words in
said corpus, given only the words themselves. Now the authors have discovered that
ratios of co-occurrence probabilities are a good source of this information, so it would
be nice if our function mapped from the space of two words to compare as well as a
context word to the space of co-occurrence probability ratios. So let the function our
model is learning be given by F. A naive interpretation of the desired model is given by
the authors as:
Note that the w’s are real-valued word vectors. Now since we want to encode
information about the ratios between two words, the authors suggest using vector
differences as inputs to our function. Then we have the following:
Now we’re getting closer to something that could work, but the final function for the
GloVe model will be considerably more complex to accurately reflect certain desirable
symmetries, since distinction between the words i and j should be invariant under
commutation of the inputs. The authors also design a weighting scheme for co-
occurrences to reflect the relationship between frequency and semantic relevance. But
since I’m trying to keep these summary articles to around 800 words, we’ll cover all
that in Part III!
[EMNLP] What is GloVe? Part III
An introduction to unsupervised learning of word
embeddings from co-occurrence matrices.
towardsdatascience.com
Please check out the source paper!
Page 1 of 12
Machine Learning Arti cial Intelligence Language Tech Linguistics
About Help Legal
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA
94305jpennin@stanford.edu, richard@socher.org, manning@stanford.edu
Abstract
Recent methods for learning vector
spacerepresentations of words have
succeededin capturing fine-grained semantic
andsyntactic regularities using vector
arith- metic, but the origin of these
regularitieshas remained opaque. We analyze
andmake explicit the model properties
neededfor such regularities to emerge in
wordvectors. The result is a new global
log- bilinear regression model that
combinesthe advantages of the two major
modelfamilies in the literature: global matrix
factorization and local context
windowmethods. Our model efficiently
leveragesstatistical information by training only
onthe nonzero elements in a word-word
co- occurrence matrix, rather than on
the en- tire sparse matrix or on
individual contextwindows in a large corpus. The
model pro- duces a vector space with
meaningful sub- structure, as
evidenced by its performanceof 75% on a recent word analogy
task. Italso outperforms related models on
simi- larity tasks and named entity
recognition.
1 Introduction
Semantic vector space models of language
repre- sent each word with a real-valued
vector. Thesevectors can be used as features in a variety
of ap- plications, such as information
retrieval (Manninget al., 2008), document classification
(Sebastiani,2002), question answering (Tellex et al.,
2003),named entity recognition (Turian et al.,
2010), andparsing (Socher et al.,
2013).Most word vector methods rely on the
distanceor angle between pairs of word vectors as
the pri- mary method for evaluating the
intrinsic qualityof such a set of word representations.
Recently,Mikolov et al. (2013c) introduced a new
evalua- tion scheme based on word
analogies that probes
the finer structure of the word vector space
by ex- amining not the scalar distance
between word vec- tors, but rather their
various dimensions of dif- ference. For
example, the analogy “king is toqueen as man is to woman” should be
encodedin the vector space by the vector equation
king −queen = man − woman. This evaluation
schemefavors models that produce dimensions of
mean- ing, thereby capturing the multi-
clustering idea ofdistributed representations (Bengio,
2009).The two main model families for learning
wordvectors are: 1) global matrix factorization
meth- ods, such as latent semantic analysis
(LSA) (Deer- wester et al., 1990) and 2)
local context windowmethods, such as the skip-gram model of
Mikolovet al. (2013c). Currently, both families suffer
sig- nificant drawbacks. While methods like
LSA ef- ficiently leverage statistical
information, they dorelatively poorly on the word analogy task,
indi- cating a sub-optimal vector space
structure. Meth- ods like skip-gram may do
better on the analogytask, but they poorly utilize the statistics of
the cor- pus since they train on separate
local context win- dows instead of on global
co-occurrence counts.
In this work, we analyze the model
propertiesnecessary to produce linear directions of
meaningand argue that global log-bilinear regression
mod- els are appropriate for doing so. We
propose a spe- cific weighted least squares
model that trains onglobal word-word co-occurrence counts and
thusmakes efficient use of statistics. The model
pro- duces a word vector space with
meaningful sub- structure, as evidenced by
its state-of-the-art per- formance of 75%
accuracy on the word analogydataset. We also demonstrate that our
methodsoutperform other current methods on
several wordsimilarity tasks, and also on a common
named en- tity recognition (NER)
benchmark.We provide the source code for the model
aswell as trained word vectors at http://nlp.
stanford.edu/projects/glove/.
Page 2 of 12
Page 1 / 12

More Related Content

Similar to [Emnlp] what is glo ve part ii - towards data science

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...
IJwest
 
An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...
Md. Hanif Seddiqui
 

Similar to [Emnlp] what is glo ve part ii - towards data science (20)

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...
 
An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...An efficient metric of automatic weight generation for properties in instance...
An efficient metric of automatic weight generation for properties in instance...
 
semeval2016
semeval2016semeval2016
semeval2016
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
IRJET - Automatic Text Summarization of News Articles
IRJET -  	  Automatic Text Summarization of News ArticlesIRJET -  	  Automatic Text Summarization of News Articles
IRJET - Automatic Text Summarization of News Articles
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

[Emnlp] what is glo ve part ii - towards data science

  • 1. [EMNLP] What is GloVe? Part II An introduction to unsupervised learning of word embeddings from co-occurrence matrices. Brendan Whitaker May 25, 2018 · 5 min read In this article, we’ll discuss one of the newer methods of creating vector space models of word semantics, more commonly known as word embeddings. The original paper by J. Pennington, R. Socher, and C. Manning is available here: http://www.aclweb.org/anthology/D14-1162.This method combines elements from the two main word embedding models which existed when GloVe, short for “Global Vectors [for word representation]” was proposed: global matrix factorization and local context window methods. In Part I, we compared these two different approaches. Now we’ll give an explanation of the GloVe embedding generation algorithm and how it improves on these previous methods.
  • 2. . . . Co-occurrence probabilities. Recall from the Part I of this series that term-term frequency matrices encode how often terms appear in the context of one another by enumerating each unique token in the corpus along both of the axes of a large 2-dimensional matrix. Performing matrix factorization gives us a low rank approximation of the whole of the data contained in the original matrix. However, as we’ll explain in a moment, the authors of GloVe discovered via empirical methods that instead of learning the raw co-occurrence probabilities, it may make more sense to learn ratios of these co-occurrence probabilities, which seem to better discriminate subtleties in term-term relevance. To illustrate this, we borrow an example from their paper: suppose we wish to study the relationship between two words, i = ice and j = steam. We’ll do this by examining the co-occurrence probabilities of these words with various “probe” words. We define the co-occurrence probability of an arbitrary word i with an arbitrary word j to be the probability that word j appears in the context of word i. This is represented by the equation and definitions below. Note X_i is defined as the number of times any word appears in the context of word i, so it’s defined as the sum over all words k of the number of times word k occurs in the context of word i. So if we choose a probe word k = solid which is closely related to i = ice but not to j = steam, we expect the ratio P_{ik}/P_{jk} of co-occurrence probabilities to be large, since solid should, in theory, appear in the context of ice more often than it would appear in the context of steam, since ice is a solid and steam is not.Conversely, for a choice of k = gas, we would expect the same ratio to be small,
  • 3. since steam is more closely related to gas than ice is. Then we also have words like water, which are closely related to both ice and steam, but not more to one than the other. And we also have words like fashion which are not closely related to either of the words in question. For both water and fashion, we expect our ratio to be close to 1 since there shouldn’t be any bias to one of ice or steam. Now it is important to note that since we are trying to determine information about the relationship between the words ice and steam, water doesn’t give us a lot of useful information. For discriminative purposes, it doesn’t give us a good idea of how “far apart” steam is from ice, and the information that steam, ice, and water are all related is already captured in the discriminative information between ice and water, and steam and water. Words that don’t help us distinguish between i and j are referred to as noise points, and it is the use of the ratio between co-occurrence probabilities helps filter out these noise points. This is well-illustrated by the real data for these example points, which we print here from Table 1 in the GloVe paper. source So let’s take a step back for a moment and recall the overall structure of the problem. We want to take data from the corpus in the form of global statistics and learn a function that gives us information about the relationship between any two words in said corpus, given only the words themselves. Now the authors have discovered that ratios of co-occurrence probabilities are a good source of this information, so it would be nice if our function mapped from the space of two words to compare as well as a context word to the space of co-occurrence probability ratios. So let the function our model is learning be given by F. A naive interpretation of the desired model is given by the authors as:
  • 4. Note that the w’s are real-valued word vectors. Now since we want to encode information about the ratios between two words, the authors suggest using vector differences as inputs to our function. Then we have the following: Now we’re getting closer to something that could work, but the final function for the GloVe model will be considerably more complex to accurately reflect certain desirable symmetries, since distinction between the words i and j should be invariant under commutation of the inputs. The authors also design a weighting scheme for co- occurrences to reflect the relationship between frequency and semantic relevance. But since I’m trying to keep these summary articles to around 800 words, we’ll cover all that in Part III! [EMNLP] What is GloVe? Part III An introduction to unsupervised learning of word embeddings from co-occurrence matrices. towardsdatascience.com Please check out the source paper! Page 1 of 12
  • 5. Machine Learning Arti cial Intelligence Language Tech Linguistics About Help Legal GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305jpennin@stanford.edu, richard@socher.org, manning@stanford.edu Abstract Recent methods for learning vector spacerepresentations of words have succeededin capturing fine-grained semantic andsyntactic regularities using vector arith- metic, but the origin of these regularitieshas remained opaque. We analyze andmake explicit the model properties neededfor such regularities to emerge in wordvectors. The result is a new global log- bilinear regression model that combinesthe advantages of the two major modelfamilies in the literature: global matrix factorization and local context windowmethods. Our model efficiently leveragesstatistical information by training only onthe nonzero elements in a word-word co- occurrence matrix, rather than on the en- tire sparse matrix or on individual contextwindows in a large corpus. The model pro- duces a vector space with meaningful sub- structure, as evidenced by its performanceof 75% on a recent word analogy task. Italso outperforms related models on simi- larity tasks and named entity recognition. 1 Introduction Semantic vector space models of language repre- sent each word with a real-valued vector. Thesevectors can be used as features in a variety of ap- plications, such as information retrieval (Manninget al., 2008), document classification (Sebastiani,2002), question answering (Tellex et al., 2003),named entity recognition (Turian et al., 2010), andparsing (Socher et al., 2013).Most word vector methods rely on the distanceor angle between pairs of word vectors as the pri- mary method for evaluating the intrinsic qualityof such a set of word representations. Recently,Mikolov et al. (2013c) introduced a new evalua- tion scheme based on word analogies that probes the finer structure of the word vector space by ex- amining not the scalar distance between word vec- tors, but rather their various dimensions of dif- ference. For example, the analogy “king is toqueen as man is to woman” should be encodedin the vector space by the vector equation king −queen = man − woman. This evaluation schemefavors models that produce dimensions of mean- ing, thereby capturing the multi- clustering idea ofdistributed representations (Bengio, 2009).The two main model families for learning wordvectors are: 1) global matrix factorization meth- ods, such as latent semantic analysis (LSA) (Deer- wester et al., 1990) and 2) local context windowmethods, such as the skip-gram model of Mikolovet al. (2013c). Currently, both families suffer sig- nificant drawbacks. While methods like LSA ef- ficiently leverage statistical information, they dorelatively poorly on the word analogy task, indi- cating a sub-optimal vector space structure. Meth- ods like skip-gram may do better on the analogytask, but they poorly utilize the statistics of the cor- pus since they train on separate local context win- dows instead of on global co-occurrence counts. In this work, we analyze the model propertiesnecessary to produce linear directions of meaningand argue that global log-bilinear regression mod- els are appropriate for doing so. We propose a spe- cific weighted least squares model that trains onglobal word-word co-occurrence counts and thusmakes efficient use of statistics. The model pro- duces a word vector space with meaningful sub- structure, as evidenced by its state-of-the-art per- formance of 75% accuracy on the word analogydataset. We also demonstrate that our methodsoutperform other current methods on several wordsimilarity tasks, and also on a common named en- tity recognition (NER) benchmark.We provide the source code for the model aswell as trained word vectors at http://nlp. stanford.edu/projects/glove/. Page 2 of 12 Page 1 / 12