[Emnlp] what is glo ve part iii - towards data science

[EMNLP] What is GloVe? Part III
An introduction to unsupervised learning of word embeddings from
co-occurrence matrices.
Brendan Whitaker
May 27, 2018 · 5 min read
The nal GloVe model. We haven’t de ned a lot of the variables seen here, but worry not, we’ll get there.
If you’re just joining us, please feel free to read Parts I and II first, as we’re picking up
right where they left off:
[EMNLP] What is GloVe? Part I
An introduction to unsupervised learning of word
embeddings from co-occurrence matrices.
towardsdatascience.com

[EMNLP] What is GloVe? Part II
In this article, we’ll discuss one of the newer methods of creating vector space models
of word semantics, more commonly known as word embeddings. The original paper by
J. Pennington, R. Socher, and C. Manning is available here:
http://www.aclweb.org/anthology/D14-1162.This method combines elements from
the two main word embedding models which existed when GloVe, short for “Global
Vectors [for word representation]” was proposed: global matrix factorization and local
context window methods. In Part I, we compared these two different approaches. In
Part II, we began walking through the authors’ development of the GloVe model. Now
we’ll summarize the rest of the derivation.
. . .
Recall that we’re attempting to design a function which maps word vectors to ratios of
co-occurrence probabilities. We have two word vectors which we’d like to discriminate
between, and a context word vector which is used to this effect. Our naive model
simply maps (using magic or whatever) the vectors right to these probabilities.
Unfortunately there are a plethora of different functions that will satisfy these
constraints and thus we must find one which best reflects the relationship we’re trying
to model, namely similarity of meaning.
So then the authors decided to use the vector difference of the two words i and j we’re
comparing as an input instead of both of these words individually, since our output is a
ratio between their co-occurrence probabilities with the context word. So now we have
two arguments, the context word vector, and the vector difference of the two words
we’re comparing. Since the authors wish to take scalar values to scalar values (note the
ratio of probabilities is a scalar), the dot product of these two arguments is taken, and
so the next iteration of our model looks like this:

The next issue we will resolve is that of the labeling of certain words as “context
words”. The problem with this is that the distinction between ordinary word vectors
and context word vectors is in reality arbitrary: there is no distinction. We should be
able to interchange them without causing problems. The way we work around this is by
requiring that F be a homomorphism from the additive group of real numbers to the
multiplicative group of positive real numbers.
Recall from elementary group theory that a homomorphism is a well-defined mapping
which preserves the group operation. So we need the following condition to be
satisfied:
Note addition within the domain of the function and the multiplication in the target
space. Now recall that we said our function’s domain is now scalar, specifically all real
numbers. That means that any input must be the dot product of two word vectors, as
opposed to being a single word vector. This is because if we only take a a single word
vector as an input, it would not be scalar. So we can think of a and b in the above
condition as dot products of two arbitrary word vectors w_a and v_a and w_b and v_b.
Letting V be the vector space where all our word vectors live, we can then rewrite the
condition:

But now remember that we want to define everything in terms of vector differences. So
instead of adding in the domain, we’ll add the additive inverse, i.e. subtract. And since
we want this to be a homomorphism, this will correspond to multiplying by the
multiplicative inverse in the target space (remember the target space is the group of
positive real numbers under multiplication). And this is just division. So we have
With a bit of relabeling to reflect the context word vectors v_a and v_b being equal,
and making use of distributivity in Euclidean space, we arrive at the condition the
authors give:
Now, setting this equation equal to the scalar input model we derived above, we have

and we make the following natural definition for the quantities we’re dividing on the
left:
Recall we defined in Part II X_{ik} to be the number of times word k appears in the
context of word i, and X_i to be the number of times any word appears in the context of
word i.
Now, what remains is to find a function F which behaves like the arbitrary one we’ve
described above. A nice place to start would be something that gives us a natural
homomorphism between the additive and multiplicative real numbers, i.e. a function
that turns addition into multiplication, or vice versa, as long as we have an inverse
where we need it. So what might work?
We’ll answer that question in Part IV 😊 Thanks so much for reading!
[EMNLP] What is GloVe? Part IV
Please check out the source paper!
Page 1 of 12
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA 94305
jpennin@stanford.edu, richard@socher.org, manning@stanford.edu

Machine Learning Technology Language Communication Arti cial Intelligence
About Help Legal
Abstract
Recent methods for learning vector space
representations of words have succeeded
in capturing fine-grained semantic and
syntactic regularities using vector arith-
metic, but the origin of these regularities
has remained opaque. We analyze and
make explicit the model properties
neededfor such regularities to emerge in word
vectors. The result is a new global log-
bilinear regression model that combines
the advantages of the two major model
families in the literature: global matrix
factorization and local context window
methods. Our model efficiently leverages
statistical information by training only on
the nonzero elements in a word-word co-
occurrence matrix, rather than on the en-
tire sparse matrix or on individual context
windows in a large corpus. The model
pro- duces a vector space with meaningful
sub- structure, as evidenced by its
performance
of 75% on a recent word analogy task. It
also outperforms related models on simi-
larity tasks and named entity recognition.
1 Introduction
Semantic vector space models of language repre-
sent each word with a real-valued vector. These
vectors can be used as features in a variety of
ap- plications, such as information retrieval
(Manning
et al., 2008), document classification (Sebastiani,
2002), question answering (Tellex et al., 2003),
named entity recognition (Turian et al., 2010), and
parsing (Socher et al.,
2013).Most word vector methods rely on the distance
or angle between pairs of word vectors as the pri-
mary method for evaluating the intrinsic quality
of such a set of word representations. Recently,
Mikolov et al. (2013c) introduced a new evalua-
tion scheme based on word analogies that
probes
the finer structure of the word vector space by ex-
amining not the scalar distance between word
vectors, but rather their various dimensions of
difference. For example, the analogy “king is to
queen as man is to woman” should be encoded
in the vector space by the vector equation king −
queen = man − woman. This evaluation scheme
favors models that produce dimensions of mean-
ing, thereby capturing the multi-clustering idea of
distributed representations (Bengio, 2009).
The two main model families for learning word
vectors are: 1) global matrix factorization meth-
ods, such as latent semantic analysis (LSA)
(Deer- wester et al., 1990) and 2) local context
window
methods, such as the skip-gram model of Mikolov
et al. (2013c). Currently, both families suffer sig-
nificant drawbacks. While methods like LSA ef-
ficiently leverage statistical information, they do
relatively poorly on the word analogy task, indi-
cating a sub-optimal vector space structure.
Meth- ods like skip-gram may do better on the
analogy
task, but they poorly utilize the statistics of the
corpus since they train on separate local
context windows instead of on global co-
occurrence counts.
In this work, we analyze the model properties
necessary to produce linear directions of meaning
and argue that global log-bilinear regression mod-
els are appropriate for doing so. We propose a
spe- cific weighted least squares model that trains
on
global word-word co-occurrence counts and thus
makes efficient use of statistics. The model pro-
duces a word vector space with meaningful sub-
structure, as evidenced by its state-of-the-art per-
formance of 75% accuracy on the word analogy
dataset. We also demonstrate that our methods
outperform other current methods on several
wordsimilarity tasks, and also on a common named
entity recognition (NER) benchmark.
We provide the source code for the model as
well as trained word vectors at http://nlp.
stanford.edu/projects/glove/.
Page 1 / 12

[Emnlp] what is glo ve part iii - towards data science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to [Emnlp] what is glo ve part iii - towards data science

Similar to [Emnlp] what is glo ve part iii - towards data science (20)

Recently uploaded

Recently uploaded (20)

[Emnlp] what is glo ve part iii - towards data science