SlideShare a Scribd company logo
word2vec Summary
Bennett Bullock
October 1, 2016
word2vec is a neural model that predicts the word from its context. It has
two implementations, Continuous Bag of Words (CBOW) and Skip-Gram
model. It is a rewrite of “How exactly does word2vec work?” by David
Meyer.
1 CBOW Model
The CBOW model attempts to maximize the probability of an output word
wO given an input input word is defined as wI. For example, if ”dog loves
food” is the sentence, pairs of wO and wI are (“loves”, “dog”) and (“food”,“loves”).
CBOW maximizes the probability p(wO|wI).
N is the number of words in the vocabulary, and xI is an N-dimensional
one-hot input vector representing wI. yO is an N-dimensional one-hot output
vector representing wO. D < N is the dimensionality of the hidden layers,
and the dimensionality of the final word2vec layers.
word2vec computes a 2-layer neural network, where W is the D×N input
layer, and V is the N ×D output layer, ˆyO is the predicted vector. Prediction
by the network happens as:
ˆyO = V WxI = V hI (1)
where hI is a D-dimensional vector. Since xI is one-hot, with the i-th word
as 1, hI is the i-th column of W. This D-dimensional vector is the final
vector used in word2vec applications. If i is the index of the output word,
gO is the i-th row of V .
1
How do we compute W and V ? First, we frame the probability using
Softmax:
p(wO|wI) =
exp(hI
T
gO)
ˆO exp(hI
T
g ˆO)
(2)
where ˆO corresponds to all other words. It is important to note that the sum
in the denominator is almost never computed explicitly, but through random
sampling of words. An intuitive explanation of this equation is that gO is a
vector which explains the context in which wI appears. Maximizing hI
T
gO
for hI ensures that words appearing in similar contexts will have similar hI
vectors. Minimizing the sum in the denominator ensures that contexts where
wI does not appear will not have vectors g ˆO which are similar to hI.
We maximize the logarithm of this probability. For convenience, we will
define sIO = hI
T
gO. For an input word wI and output wO, we optimize a
loss function:
EIO = sIO − log
ˆO
exp(hI
T
g ˆO) (3)
Using backpropogating stochastic descent to optimize this function, we begin
by taking the derivative of the loss function over the output layer ∂EIO
∂Vij
, where
Vij is the element of V corresponding with the i-th output word dimension
and the j-th dimension. k is the index of the output word, and we will restate
EIO as Eik, and sIO as sik. Using the chain rule, we can state this in terms
of two easily computable derivatives:
∂Eik
∂Vkj
=
∂Eik
∂sik
∂sik
∂Vkj
(4)
Computing the first derivative on the right-hand side:
∂Eik
∂sik
=
∂sik
∂sik
−
∂log ˆO exp(hI
T
g ˆO)
∂sIO
= tk −
sik
l exp(sil)
(5)
tk is an indicator function for whether the k-th element of yO is one. We can
see, as well, that the second term on the right-hand side is the normalized
k-th element of the output vector, ˆyO = V WxI, so this derivative is the error
of the output, or ek = yOk − ˆyOk.
Computing ∂sik
∂Vkj
is simpler:
∂sik
∂Vkj
=
∂ l VklWli
∂Vkj
= Wji = hIj (6)
2
so that ∂Eik
∂Vij
= ekhIj = ekWji. This gives us the update rule for elements of
V , with a learning rate of η:
Vkj
(n+1)
= Vkj
(n)
+ ηekWji (7)
Or, in terms of gO:
gO
(n+1)
= gO
(n)
+ ηeOhI (8)
where eO is the error for wO.
The derivative of the loss function for the input layer is:
∂Eik
∂Wjk
=
∂Eik
∂sik
∂sik
∂Wjk
(9)
We apply the chain rule for the derivative of the loss function:
∂Eik
∂Wjk
=
∂Eik
∂sik
∂sik
∂Wjk
= ei
∂sik
∂Wjk
(10)
The derivative for Wmn is straightforward:
∂sik
∂Wmn
=
∂ k hIkgOk
∂hIn
=
∂ k VnkWkn
∂Wmn
= Vnm = gOm (11)
which gives us ∂EIO
∂Wmn
= enVnm = eOgOm, and the update rule for Wmn and
hI:
Wmn
(n+1)
= Wmn
(n)
+ ηejVnm (12)
hI
(n+1)
= hI
(n)
+ ηeOgO (13)
2 Skip-Gram Model
In a skip-gram model, we want to predict the word from a context. This
means we are interested in finding the context. This means that have the
same layers V and W, but we want to predict L multiple words that wOl
occur before and/or after wI. We maximize, using Softmax:
p(wO1 , wO2 ...wOL
|wI) = log
l=L
l=1 exp(hIgOl
)
ˆO exp(hIg ˆO)
(14)
3
which gives us our loss function, where O represents the context rather than
single input:
EIO =
l
exp(hIgOl
) − log
ˆO
exp(hIg ˆO) (15)
We use the chain rule like we did in the first section to compute the derivative
for the output layer, but we sum the errors over L, where PO is set of indices
for the words in context Oi correspinding with input word i:
∂EIO
∂Vij
=
p∈POi
ep
∂ k VpkWkp
∂Vij
=
p∈POi
epWjp (16)
giving us update rules:
Vij
(n+1)
= Vij
(n)
+ η
p∈POi
epWjp (17)
gO
(n+1)
= gO
(n)
+ η(
l
eOl
)hI (18)
Update rules for the input layer are derived in the same way as the previous
section, but with the new error function:
Wmn
(n+1)
= Wmn
(n)
+ η
p∈POn
epVpm (19)
hI
(n+1)
= hI
(n)
+ η(
l
eOl
)gO (20)
4

More Related Content

Similar to word2vec_summary_revised

Chapter 2.pdf
Chapter 2.pdfChapter 2.pdf
Chapter 2.pdf
ssuserf7cd2b
 
SESSION-11 PPT.pptx
SESSION-11 PPT.pptxSESSION-11 PPT.pptx
SESSION-11 PPT.pptx
NaniSarath
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
IRJET- On Semigroup and its Connections with Lattices
IRJET- On Semigroup and its Connections with LatticesIRJET- On Semigroup and its Connections with Lattices
IRJET- On Semigroup and its Connections with Lattices
IRJET Journal
 
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
Tobias Wunner
 
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
Steven Duplij (Stepan Douplii)
 
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022SEcon 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
EvonCanales257
 
Skiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reductionSkiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reduction
zukun
 
FUNCTIONS and INVERSES.pdf
FUNCTIONS and INVERSES.pdfFUNCTIONS and INVERSES.pdf
FUNCTIONS and INVERSES.pdf
Vukile Xhego
 
Ijetr021233
Ijetr021233Ijetr021233
Admission in India 2014
Admission in India 2014Admission in India 2014
Admission in India 2014
Edhole.com
 
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
Cristiano Longo
 
Mcs 013 solve assignment
Mcs 013 solve assignmentMcs 013 solve assignment
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
Sumit Singh
 
Website designing compay in noida
Website designing compay in noidaWebsite designing compay in noida
Website designing compay in noida
Css Founder
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
Fujimoto Keisuke
 
Class1
 Class1 Class1
Class1
issbp
 
Admission in india 2015
Admission in india 2015Admission in india 2015
Admission in india 2015
Edhole.com
 
Properties of fuzzy inner product spaces
Properties of fuzzy inner product spacesProperties of fuzzy inner product spaces
Properties of fuzzy inner product spaces
ijfls
 
CostFunctions.pdf
CostFunctions.pdfCostFunctions.pdf
CostFunctions.pdf
VincentTaziMugwira
 

Similar to word2vec_summary_revised (20)

Chapter 2.pdf
Chapter 2.pdfChapter 2.pdf
Chapter 2.pdf
 
SESSION-11 PPT.pptx
SESSION-11 PPT.pptxSESSION-11 PPT.pptx
SESSION-11 PPT.pptx
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
IRJET- On Semigroup and its Connections with Lattices
IRJET- On Semigroup and its Connections with LatticesIRJET- On Semigroup and its Connections with Lattices
IRJET- On Semigroup and its Connections with Lattices
 
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
 
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
Steven Duplij, Raimund Vogl, "Polyadic Braid Operators and Higher Braiding Ga...
 
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022SEcon 103 Homework 2Manu NavjeevanAugust 15, 2022S
Econ 103 Homework 2Manu NavjeevanAugust 15, 2022S
 
Skiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reductionSkiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reduction
 
FUNCTIONS and INVERSES.pdf
FUNCTIONS and INVERSES.pdfFUNCTIONS and INVERSES.pdf
FUNCTIONS and INVERSES.pdf
 
Ijetr021233
Ijetr021233Ijetr021233
Ijetr021233
 
Admission in India 2014
Admission in India 2014Admission in India 2014
Admission in India 2014
 
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
Herbrand-satisfiability of a Quantified Set-theoretical Fragment (Cantone, Lo...
 
Mcs 013 solve assignment
Mcs 013 solve assignmentMcs 013 solve assignment
Mcs 013 solve assignment
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
 
Website designing compay in noida
Website designing compay in noidaWebsite designing compay in noida
Website designing compay in noida
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
 
Class1
 Class1 Class1
Class1
 
Admission in india 2015
Admission in india 2015Admission in india 2015
Admission in india 2015
 
Properties of fuzzy inner product spaces
Properties of fuzzy inner product spacesProperties of fuzzy inner product spaces
Properties of fuzzy inner product spaces
 
CostFunctions.pdf
CostFunctions.pdfCostFunctions.pdf
CostFunctions.pdf
 

word2vec_summary_revised

  • 1. word2vec Summary Bennett Bullock October 1, 2016 word2vec is a neural model that predicts the word from its context. It has two implementations, Continuous Bag of Words (CBOW) and Skip-Gram model. It is a rewrite of “How exactly does word2vec work?” by David Meyer. 1 CBOW Model The CBOW model attempts to maximize the probability of an output word wO given an input input word is defined as wI. For example, if ”dog loves food” is the sentence, pairs of wO and wI are (“loves”, “dog”) and (“food”,“loves”). CBOW maximizes the probability p(wO|wI). N is the number of words in the vocabulary, and xI is an N-dimensional one-hot input vector representing wI. yO is an N-dimensional one-hot output vector representing wO. D < N is the dimensionality of the hidden layers, and the dimensionality of the final word2vec layers. word2vec computes a 2-layer neural network, where W is the D×N input layer, and V is the N ×D output layer, ˆyO is the predicted vector. Prediction by the network happens as: ˆyO = V WxI = V hI (1) where hI is a D-dimensional vector. Since xI is one-hot, with the i-th word as 1, hI is the i-th column of W. This D-dimensional vector is the final vector used in word2vec applications. If i is the index of the output word, gO is the i-th row of V . 1
  • 2. How do we compute W and V ? First, we frame the probability using Softmax: p(wO|wI) = exp(hI T gO) ˆO exp(hI T g ˆO) (2) where ˆO corresponds to all other words. It is important to note that the sum in the denominator is almost never computed explicitly, but through random sampling of words. An intuitive explanation of this equation is that gO is a vector which explains the context in which wI appears. Maximizing hI T gO for hI ensures that words appearing in similar contexts will have similar hI vectors. Minimizing the sum in the denominator ensures that contexts where wI does not appear will not have vectors g ˆO which are similar to hI. We maximize the logarithm of this probability. For convenience, we will define sIO = hI T gO. For an input word wI and output wO, we optimize a loss function: EIO = sIO − log ˆO exp(hI T g ˆO) (3) Using backpropogating stochastic descent to optimize this function, we begin by taking the derivative of the loss function over the output layer ∂EIO ∂Vij , where Vij is the element of V corresponding with the i-th output word dimension and the j-th dimension. k is the index of the output word, and we will restate EIO as Eik, and sIO as sik. Using the chain rule, we can state this in terms of two easily computable derivatives: ∂Eik ∂Vkj = ∂Eik ∂sik ∂sik ∂Vkj (4) Computing the first derivative on the right-hand side: ∂Eik ∂sik = ∂sik ∂sik − ∂log ˆO exp(hI T g ˆO) ∂sIO = tk − sik l exp(sil) (5) tk is an indicator function for whether the k-th element of yO is one. We can see, as well, that the second term on the right-hand side is the normalized k-th element of the output vector, ˆyO = V WxI, so this derivative is the error of the output, or ek = yOk − ˆyOk. Computing ∂sik ∂Vkj is simpler: ∂sik ∂Vkj = ∂ l VklWli ∂Vkj = Wji = hIj (6) 2
  • 3. so that ∂Eik ∂Vij = ekhIj = ekWji. This gives us the update rule for elements of V , with a learning rate of η: Vkj (n+1) = Vkj (n) + ηekWji (7) Or, in terms of gO: gO (n+1) = gO (n) + ηeOhI (8) where eO is the error for wO. The derivative of the loss function for the input layer is: ∂Eik ∂Wjk = ∂Eik ∂sik ∂sik ∂Wjk (9) We apply the chain rule for the derivative of the loss function: ∂Eik ∂Wjk = ∂Eik ∂sik ∂sik ∂Wjk = ei ∂sik ∂Wjk (10) The derivative for Wmn is straightforward: ∂sik ∂Wmn = ∂ k hIkgOk ∂hIn = ∂ k VnkWkn ∂Wmn = Vnm = gOm (11) which gives us ∂EIO ∂Wmn = enVnm = eOgOm, and the update rule for Wmn and hI: Wmn (n+1) = Wmn (n) + ηejVnm (12) hI (n+1) = hI (n) + ηeOgO (13) 2 Skip-Gram Model In a skip-gram model, we want to predict the word from a context. This means we are interested in finding the context. This means that have the same layers V and W, but we want to predict L multiple words that wOl occur before and/or after wI. We maximize, using Softmax: p(wO1 , wO2 ...wOL |wI) = log l=L l=1 exp(hIgOl ) ˆO exp(hIg ˆO) (14) 3
  • 4. which gives us our loss function, where O represents the context rather than single input: EIO = l exp(hIgOl ) − log ˆO exp(hIg ˆO) (15) We use the chain rule like we did in the first section to compute the derivative for the output layer, but we sum the errors over L, where PO is set of indices for the words in context Oi correspinding with input word i: ∂EIO ∂Vij = p∈POi ep ∂ k VpkWkp ∂Vij = p∈POi epWjp (16) giving us update rules: Vij (n+1) = Vij (n) + η p∈POi epWjp (17) gO (n+1) = gO (n) + η( l eOl )hI (18) Update rules for the input layer are derived in the same way as the previous section, but with the new error function: Wmn (n+1) = Wmn (n) + η p∈POn epVpm (19) hI (n+1) = hI (n) + η( l eOl )gO (20) 4