word2vec_summary_revised

word2vec Summary
Bennett Bullock
October 1, 2016
word2vec is a neural model that predicts the word from its context. It has
two implementations, Continuous Bag of Words (CBOW) and Skip-Gram
model. It is a rewrite of “How exactly does word2vec work?” by David
Meyer.
1 CBOW Model
The CBOW model attempts to maximize the probability of an output word
wO given an input input word is defined as wI. For example, if ”dog loves
food” is the sentence, pairs of wO and wI are (“loves”, “dog”) and (“food”,“loves”).
CBOW maximizes the probability p(wO|wI).
N is the number of words in the vocabulary, and xI is an N-dimensional
one-hot input vector representing wI. yO is an N-dimensional one-hot output
vector representing wO. D < N is the dimensionality of the hidden layers,
and the dimensionality of the final word2vec layers.
word2vec computes a 2-layer neural network, where W is the D×N input
layer, and V is the N ×D output layer, ˆyO is the predicted vector. Prediction
by the network happens as:
ˆyO = V WxI = V hI (1)
where hI is a D-dimensional vector. Since xI is one-hot, with the i-th word
as 1, hI is the i-th column of W. This D-dimensional vector is the final
vector used in word2vec applications. If i is the index of the output word,
gO is the i-th row of V .
1

How do we compute W and V ? First, we frame the probability using
Softmax:
p(wO|wI) =
exp(hI
T
gO)
Ô exp(hI
T
g Ô)
(2)
where Ô corresponds to all other words. It is important to note that the sum
in the denominator is almost never computed explicitly, but through random
sampling of words. An intuitive explanation of this equation is that gO is a
vector which explains the context in which wI appears. Maximizing hI
T
gO
for hI ensures that words appearing in similar contexts will have similar hI
vectors. Minimizing the sum in the denominator ensures that contexts where
wI does not appear will not have vectors g Ô which are similar to hI.
We maximize the logarithm of this probability. For convenience, we will
define sIO = hI
T
gO. For an input word wI and output wO, we optimize a
loss function:
EIO = sIO − log
Ô
exp(hI
T
g Ô) (3)
Using backpropogating stochastic descent to optimize this function, we begin
by taking the derivative of the loss function over the output layer ∂EIO
∂Vij
, where
Vij is the element of V corresponding with the i-th output word dimension
and the j-th dimension. k is the index of the output word, and we will restate
EIO as Eik, and sIO as sik. Using the chain rule, we can state this in terms
of two easily computable derivatives:
∂Eik
∂Vkj
=
∂Eik
∂sik
∂sik
∂Vkj
(4)
Computing the first derivative on the right-hand side:
∂Eik
∂sik
=
∂sik
∂sik
−
∂log Ô exp(hI
T
g Ô)
∂sIO
= tk −
sik
l exp(sil)
(5)
tk is an indicator function for whether the k-th element of yO is one. We can
see, as well, that the second term on the right-hand side is the normalized
k-th element of the output vector, ˆyO = V WxI, so this derivative is the error
of the output, or ek = yOk − ˆyOk.
Computing ∂sik
∂Vkj
is simpler:
∂sik
∂Vkj
=
∂ l VklWli
∂Vkj
= Wji = hIj (6)
2

so that ∂Eik
∂Vij
= ekhIj = ekWji. This gives us the update rule for elements of
V , with a learning rate of η:
Vkj
(n+1)
= Vkj
(n)
+ ηekWji (7)
Or, in terms of gO:
gO
(n+1)
= gO
(n)
+ ηeOhI (8)
where eO is the error for wO.
The derivative of the loss function for the input layer is:
∂Eik
∂Wjk
=
∂Eik
∂sik
∂sik
∂Wjk
(9)
We apply the chain rule for the derivative of the loss function:
∂Eik
∂Wjk
=
∂Eik
∂sik
∂sik
∂Wjk
= ei
∂sik
∂Wjk
(10)
The derivative for Wmn is straightforward:
∂sik
∂Wmn
=
∂ k hIkgOk
∂hIn
=
∂ k VnkWkn
∂Wmn
= Vnm = gOm (11)
which gives us ∂EIO
∂Wmn
= enVnm = eOgOm, and the update rule for Wmn and
hI:
Wmn
(n+1)
= Wmn
(n)
+ ηejVnm (12)
hI
(n+1)
= hI
(n)
+ ηeOgO (13)
2 Skip-Gram Model
In a skip-gram model, we want to predict the word from a context. This
means we are interested in finding the context. This means that have the
same layers V and W, but we want to predict L multiple words that wOl
occur before and/or after wI. We maximize, using Softmax:
p(wO1 , wO2 ...wOL
|wI) = log
l=L
l=1 exp(hIgOl
)
Ô exp(hIg Ô)
(14)
3

which gives us our loss function, where O represents the context rather than
single input:
EIO =
l
exp(hIgOl
) − log
Ô
exp(hIg Ô) (15)
We use the chain rule like we did in the first section to compute the derivative
for the output layer, but we sum the errors over L, where PO is set of indices
for the words in context Oi correspinding with input word i:
∂EIO
∂Vij
=
p∈POi
ep
∂ k VpkWkp
∂Vij
=
p∈POi
epWjp (16)
giving us update rules:
Vij
(n+1)
= Vij
(n)
+ η
p∈POi
epWjp (17)
gO
(n+1)
= gO
(n)
+ η(
l
eOl
)hI (18)
Update rules for the input layer are derived in the same way as the previous
section, but with the new error function:
Wmn
(n+1)
= Wmn
(n)
+ η
p∈POn
epVpm (19)
hI
(n+1)
= hI
(n)
+ η(
l
eOl
)gO (20)
4

word2vec_summary_revised

Recommended

Recommended

More Related Content

Similar to word2vec_summary_revised

Similar to word2vec_summary_revised (20)

word2vec_summary_revised