Word2vec from scratch
11/10/2015
Jinpyo Lee
KAIST
Contents
• Introduction
• Previous Methods for represent words
• Word2Vec
• Extensions of skip-gram model / Learning Phrase / Additive
Compositionality & Evaluation
• Conclusion
• Demo
• Discussions
• References
Introduction
• Example of NLP processing
• EASY
• Spell Chekcing (Checking)
• Keyword Search (Ctrl+F)
• Finding Synonyms
• MEDIUM
• Parsing information form documents, web, etc.
• HARD
• Machine Translation (e.g. Translate Korean to English)
• Semantic Analysis (e.g. What’s meaning of this query?)
• Co-reference (e.g. What does “it” refers in this sentence?)
• Question Answering (e.g. IBM Watson)
Introduction
• BUT, Most important is
How we represent words
as input for all the NLP tasks.
Introduction
• BUT, Most important is
How we represent meaning of words
as input for all the NLP tasks.
• At first, most NLP treated word as ATOMIC symbol
• They needed notion of similarity & difference
• So,
• WordNet: Taxonomy has hypernyms (is-a)
relationship and synonym set
Simple example of wordnet showing synonyms and antonyms
Prev. Methods for represent words
- Discrete Representation
• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
Expert == Good ? Usually?
 Probably NO!
* Synonym set of good using nltk lib (CS224d-Lecture note)
How about new words?
: Wicked, ace, wizard, genius, ninja
- Discrete Representation
Prev. Methods for represent words
• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
* Synonym set of good using nltk lib (CS224d-Lecture note)
Disadvantage
• Hard to keep up to date
• Requires human labor
• Subjective
• Hard to compute accurate word
similarity
- Discrete Representation
Prev. Methods for represent words
• Another problem of discrete representation
• Can’t gives similarity
• Too sparse
e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ]
Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
 “one-hot” representation: Typical, simple representation.
All 0s with one 1, Identical
Horse ∩ Zebra
= [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
= 0 (nothing) (But, we know does are mammal)
- Discrete Representation
Mammal
Prev. Methods for represent words
• Use neighbor to represent words! (Co-occurence)
• Conjecture: Words that are related will often appear in the same
documents.
 Window allow capture both syntactic and semantic info.
e.g. I enjoy baseball
corpus I like NLP
I like deep learning.
* Co-occurrence Matrix with window size = 1 (CS224d-Lecture note)
Co-occurs beside “I”, 2-times
Prev. Methods for represent words
• Use this matrix for word-embedding (feat. SVD)
• Applying Single Value Decomposition
for the simplicity, SVD: X (Co-occur Mat) = U*S*VT
X U S VT
(detail would be in linear algebra textbook)
• Select k-columns from U as k-dimension word-vector
Prev. Methods for represent words
• Result of SVD based Model
K = 2 K = 3
Prev. Methods for represent words
• Disadvantage
• Co-occur Matrix is extremely sparse
• Very high dimensional
• Quadratic cost to train (i.e. perform SVD)
• Needs hacks for the imbalance in word frequency
(i.e. “it”, “the”, “has”, etc.)
• Some solutions exist for problem but, not intrinsic
Prev. Methods for represent words
Contents
• Introduction
• Previous Methods for represent words
• Word2Vec
• Extensions of skip-gram model / Learning Phrase / Additive
Compositionality & Evaluation
• Conclusion
• Demo
• Discussions
• References
Word2vec (related paper)
• Then how?
Directly learn (iteration) low-dimensional word vectors at a time!
 Go Back to the 1986
• Learning representations by back-propagating errors
(Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
• Efficient Estimation of Word Representation in Vector Space
• Distributed Representations of words and phrases and their
compositionality
7/31
Efficient Estimation of Word
Representation in Vector Space
• Introduce initial architecture of word2vec (2013)
• Two New Model: Continuous-Bag-of-word, Skip-gram model
• Empirically show that this word model has better syntactic, semantic
representation then other model
• Compare two model
• Skip-gram model works well on semantic but training is slower.
• CBOW model works well on syntactic and training is faster.
(P)Review
8/31
Word2vec (profile)
• Distributed Representations of words and phrases
and their compositionality
• NIPS 2013 (Submitted on 16 Oct 2013)
• Tomas Mikorov, (FaceBook (2014 ~ )) et al.
• Includes additional works of “Efficient Estimation of Word
Representation in Vector Space”.
9/31
Word2vec (Contents)
• This paper includes,
• Extensions of skip-gram model (fast & accurate)
• Method
• Hierarchical soft-max
• NEG
• Subsampling
• Ability of Learning Phrase
• Find Additive Compositionality
• Conclusion
10/31
• Skip-gram model
• Objective of Skip-gram model is “Find word representations
useful for predicting context words in a sentence.
• Softmax function
• …
Extension of Skip-Gram
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
BUT, without understanding
original model, we will..
..going to.. fall ...asleep..
11/31
Example
13/31
CBOW (Original)
• Continuous-Bag-of-word model
• Idea: Using context words, we can predict center word
i.e. Probability( “It is ( ? ) to finish”  “time” )
• Present word as distributed vector of probability  Low dimension
• Goal: Train weight-matrix(W ) satisfies below
• Loss-function (using cross-entropy method)
argmax 𝑊 {𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒕𝒊𝒎𝒆 − 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑝𝑟 time 𝒊𝒕, 𝒊𝒔, 𝒕𝒐, 𝒇𝒊𝒏𝒊𝒔𝒉 ; 𝑊 }
* Softmax(): K-dim vector of x∈ℝ  K-dim vector that has (0,1)∈ℝ
𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
context words (window_size=2)
12/31
CBOW (Original)
• Continuous-Bag-of-word model
• Input
• “one-hot” word vector
• Remove nonlinear hidden layer
• Back-propagate error from
output layer to Weight matrix
(Adjust W s)
It
is
finish
to
time
[
0
1
0
0
0
]
Wout T∙h =
𝒚(predicted)
[0 0 1 0 0]T
Win
∙
h
Win ∙ x i
[0 0 0 0 1]T
Win
∙
y(true) =
Backpropagate to
Minimize error
vs
Win(old) Wout(old)
Win(new) Wout(new)
Win
,Wout
∈ ℝ 𝑛×|𝑉|
: Input, output Weight
-matrix, n is dimension for word embedding
x 𝑖
, 𝑦 𝑖
: input, output word vector
(one-hot) from vocabulary V
ℎ: hidden vector, avg of W*x
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
Initial input, not results
14/31
• Skip-gram model
• Idea: With center word,
we can predict context words
• Mirror of CBOW (vice versa)
i.e. Probability( “time”  “It is ( ? ) to finish” )
• Loss-function:
Skip-Gram (Original)
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
time
It
is
to
finish
Win ∙ x i
h
y i
Win(old) Wout(old)
Win(new) Wout(new)
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
CBOW: 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
15/31
• Hierarchical Soft-max function
• To train weight matrix in every step, we need to pass the
calculated vector into Loss-Function
• Soft-max function
• Before calculate loss function
calculated vector should normalized as real-number in (0,1)
Extension of Skip-Gram(1)
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
(𝑬 = − 𝐥𝐨𝐠 𝒑 𝒘 𝒕−𝑪. . 𝒘 𝒕+𝑪 𝒘 𝒕 )
16/31
• Hierarchical Soft-max function (cont.)
• Soft-max function
(I have already calculated, it’s boring …….…)
Extension of Skip-Gram(1)
Original soft-max function
of skip-gram model
17/31
• Hierarchical Soft-max function (cont.)
• Since V is quite large, computing log(𝑝 𝑤𝑜 𝑤𝐼 ) costs to much
• Idea: Construct binary Huffman tree with word
 Cost: O( 𝑽 ) to O(lo g 𝑽 )
• Can train Faster!
• Assigning
• Word = 𝑛𝑜𝑑𝑒 𝑤, 𝐿 𝑤 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘
(* details in “Hierarchical Probabilistic Neural Network Language Model ")
Extension of Skip-Gram(1)
18/31
• Negative Sampling (similar to NCE)
• Size(Vocabulary) is computationally huge!  Slow for train
• Idea: Just sample several negative examples!
• Do not loop full vocabulary, only use neg. sample  fast
• Change the target word as negative sample and learn
negative examples  get more accuracy
• Objective function
Extension of Skip-Gram(2)
i.e. “Stock boil fish is toy” ????  negative sample
Noise Constrastive Estimation
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
19/31
• Subsampling
• (“Korea”, ”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful
• Idea: Frequent word vectors (i.e. “the”) should not change
significantly after training on several million examples.
• Each word 𝑤𝑖 in the training set is discarded with below
probability
• It aggressively subsamples frequent words while preserve
ranking of the frequencies
• But, this formula was chosen heuristically…
Extension of Skip-Gram(3)
f wi : 𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤𝑖
𝑡: 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑎𝑟𝑜𝑢𝑛𝑑 10−5
20/31
• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly.
i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”)
Accuracy = cosine_similarity( vec(X), vec(“Paris”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG(Negative Sampling)-5, 15
• Hierarchical Softmax-Huffman
• NCE-5(Noise Contrastive Estimation)
Extension of Skip-Gram
21/31
• Empirical Results
• Model w/ NEG outperforms the HS on the analogical reasoning task
(even slightly better than NCE)
• The subsampling improves the training speed several times and
makes the word representations more accurate
Extension of Skip-Gram
22/31
• Word base model can not represent idiomatic word
• i.e. “Newyork Times”, “Larry Page”
• Simple data driven approach
• If phrases are formed based on 1-gram, 2-gram counts
• Target words that has high score would meaningful phrase
Learning Phrases
𝛿: 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
(P𝑟𝑒𝑣𝑒𝑛𝑡 𝑡𝑜𝑜 𝑚𝑎𝑛𝑦 𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠)
23/31
• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly with phrase
• i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry Page”)
Accuracy = cosine_similarity( vec(x), vec(“Google”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG-5
• NEG-15
• HS-Huffman
Learning Phrases
24/31
• Empirical Results
• NEG-15 achieves better performance than NEG-5
• HS become the best performing method when subsampling
• This shows that the subsampling can result in faster training and can
also improve accuracy, at least in some cases.
• When training set = 33 billion, d=1000  72% (6B  66%)
• Amount of training set is crucial!
Learning Phrases
25/31
• Simple vector addition (on Skip-gram model)
• Previous experiments shows Analogical reasoning (A+B-C)
• Vector’s values are related logarithmically to the probabilities
 Sum of two vector is related to product of context distribution
• Interesting!
Additive Compositionality
26/31
• Contributions
• Showed detailed process of training distributed
representation of words and phrases
• Can be more accurate and faster model than previous
word2vec model by sub-sampling
• Negative Sampling: Extremely simple and accurate for
frequent words. (not frequent like phrase, HS was better)
• Word vectors can be meaningful by simple vector addition
• Made a code and dataset as open-source project
Conclusion
27/31
• Compare to other Neural network model
<Find most similar word>
• Skip-gram model trained on large corpus outperforms all
to other paper’s models.
Conclusion
28/31
• Very Interesting model
• Simple, short paper
• Easy to read
• Hard to understand detail
• In HS, way of Tree construction
• Several Heuristic methods
• Pre-processing like eliminate stop-words
Speaker’s Opinion
29/31
• Papers
• Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint
arXiv:1301.3781 (2013).
• Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." Proceedings of
the international workshop on artificial intelligence and statistics. 2005.
• Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Conference on
Language Resources and Evaluation (LREC-2006). 2006.
• Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014).
• Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-
embedding method." arXiv preprint arXiv:1402.3722(2014).
• Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine Learning
Research 12 (2011): 2493-2537.
• Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003):
1137-1155.
• Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating
errors." Cognitive modeling 5 (1988): 3.
• Websites & Courses
• Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/)
• http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
• http://nohhj.blogspot.kr/2015/08/word-embedding.html
• https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/
• http://rare-technologies.com/word2vec-tutorial/
• https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors
References
30/31
? = Word2vec(“Slide” + “End”)
End
31/31

Word2vec slide(lab seminar)

  • 1.
  • 2.
    Contents • Introduction • PreviousMethods for represent words • Word2Vec • Extensions of skip-gram model / Learning Phrase / Additive Compositionality & Evaluation • Conclusion • Demo • Discussions • References
  • 3.
    Introduction • Example ofNLP processing • EASY • Spell Chekcing (Checking) • Keyword Search (Ctrl+F) • Finding Synonyms • MEDIUM • Parsing information form documents, web, etc. • HARD • Machine Translation (e.g. Translate Korean to English) • Semantic Analysis (e.g. What’s meaning of this query?) • Co-reference (e.g. What does “it” refers in this sentence?) • Question Answering (e.g. IBM Watson)
  • 4.
    Introduction • BUT, Mostimportant is How we represent words as input for all the NLP tasks.
  • 5.
    Introduction • BUT, Mostimportant is How we represent meaning of words as input for all the NLP tasks.
  • 6.
    • At first,most NLP treated word as ATOMIC symbol • They needed notion of similarity & difference • So, • WordNet: Taxonomy has hypernyms (is-a) relationship and synonym set Simple example of wordnet showing synonyms and antonyms Prev. Methods for represent words - Discrete Representation
  • 7.
    • COOL! (seealso, Semantic Web) • Great resource but, missing nuances Expert == Good ? Usually?  Probably NO! * Synonym set of good using nltk lib (CS224d-Lecture note) How about new words? : Wicked, ace, wizard, genius, ninja - Discrete Representation Prev. Methods for represent words
  • 8.
    • COOL! (seealso, Semantic Web) • Great resource but, missing nuances * Synonym set of good using nltk lib (CS224d-Lecture note) Disadvantage • Hard to keep up to date • Requires human labor • Subjective • Hard to compute accurate word similarity - Discrete Representation Prev. Methods for represent words
  • 9.
    • Another problemof discrete representation • Can’t gives similarity • Too sparse e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]  “one-hot” representation: Typical, simple representation. All 0s with one 1, Identical Horse ∩ Zebra = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ] = 0 (nothing) (But, we know does are mammal) - Discrete Representation Mammal Prev. Methods for represent words
  • 10.
    • Use neighborto represent words! (Co-occurence) • Conjecture: Words that are related will often appear in the same documents.  Window allow capture both syntactic and semantic info. e.g. I enjoy baseball corpus I like NLP I like deep learning. * Co-occurrence Matrix with window size = 1 (CS224d-Lecture note) Co-occurs beside “I”, 2-times Prev. Methods for represent words
  • 11.
    • Use thismatrix for word-embedding (feat. SVD) • Applying Single Value Decomposition for the simplicity, SVD: X (Co-occur Mat) = U*S*VT X U S VT (detail would be in linear algebra textbook) • Select k-columns from U as k-dimension word-vector Prev. Methods for represent words
  • 12.
    • Result ofSVD based Model K = 2 K = 3 Prev. Methods for represent words
  • 13.
    • Disadvantage • Co-occurMatrix is extremely sparse • Very high dimensional • Quadratic cost to train (i.e. perform SVD) • Needs hacks for the imbalance in word frequency (i.e. “it”, “the”, “has”, etc.) • Some solutions exist for problem but, not intrinsic Prev. Methods for represent words
  • 14.
    Contents • Introduction • PreviousMethods for represent words • Word2Vec • Extensions of skip-gram model / Learning Phrase / Additive Compositionality & Evaluation • Conclusion • Demo • Discussions • References
  • 15.
    Word2vec (related paper) •Then how? Directly learn (iteration) low-dimensional word vectors at a time!  Go Back to the 1986 • Learning representations by back-propagating errors (Rumelhart et al. 1986) • A neural probabilistic language model (Bengio et al., 2003) • NLP from Scratch (Collobert & Weston, 2008) • Word2Vec (Mikolov et al. 2013) • Efficient Estimation of Word Representation in Vector Space • Distributed Representations of words and phrases and their compositionality 7/31
  • 16.
    Efficient Estimation ofWord Representation in Vector Space • Introduce initial architecture of word2vec (2013) • Two New Model: Continuous-Bag-of-word, Skip-gram model • Empirically show that this word model has better syntactic, semantic representation then other model • Compare two model • Skip-gram model works well on semantic but training is slower. • CBOW model works well on syntactic and training is faster. (P)Review 8/31
  • 17.
    Word2vec (profile) • DistributedRepresentations of words and phrases and their compositionality • NIPS 2013 (Submitted on 16 Oct 2013) • Tomas Mikorov, (FaceBook (2014 ~ )) et al. • Includes additional works of “Efficient Estimation of Word Representation in Vector Space”. 9/31
  • 18.
    Word2vec (Contents) • Thispaper includes, • Extensions of skip-gram model (fast & accurate) • Method • Hierarchical soft-max • NEG • Subsampling • Ability of Learning Phrase • Find Additive Compositionality • Conclusion 10/31
  • 19.
    • Skip-gram model •Objective of Skip-gram model is “Find word representations useful for predicting context words in a sentence. • Softmax function • … Extension of Skip-Gram 𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝 𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤 𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕 BUT, without understanding original model, we will.. ..going to.. fall ...asleep.. 11/31
  • 20.
  • 21.
    CBOW (Original) • Continuous-Bag-of-wordmodel • Idea: Using context words, we can predict center word i.e. Probability( “It is ( ? ) to finish”  “time” ) • Present word as distributed vector of probability  Low dimension • Goal: Train weight-matrix(W ) satisfies below • Loss-function (using cross-entropy method) argmax 𝑊 {𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒕𝒊𝒎𝒆 − 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑝𝑟 time 𝒊𝒕, 𝒊𝒔, 𝒕𝒐, 𝒇𝒊𝒏𝒊𝒔𝒉 ; 𝑊 } * Softmax(): K-dim vector of x∈ℝ  K-dim vector that has (0,1)∈ℝ 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶) context words (window_size=2) 12/31
  • 22.
    CBOW (Original) • Continuous-Bag-of-wordmodel • Input • “one-hot” word vector • Remove nonlinear hidden layer • Back-propagate error from output layer to Weight matrix (Adjust W s) It is finish to time [ 0 1 0 0 0 ] Wout T∙h = 𝒚(predicted) [0 0 1 0 0]T Win ∙ h Win ∙ x i [0 0 0 0 1]T Win ∙ y(true) = Backpropagate to Minimize error vs Win(old) Wout(old) Win(new) Wout(new) Win ,Wout ∈ ℝ 𝑛×|𝑉| : Input, output Weight -matrix, n is dimension for word embedding x 𝑖 , 𝑦 𝑖 : input, output word vector (one-hot) from vocabulary V ℎ: hidden vector, avg of W*x [NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1] Initial input, not results 14/31
  • 23.
    • Skip-gram model •Idea: With center word, we can predict context words • Mirror of CBOW (vice versa) i.e. Probability( “time”  “It is ( ? ) to finish” ) • Loss-function: Skip-Gram (Original) 𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡) time It is to finish Win ∙ x i h y i Win(old) Wout(old) Win(new) Wout(new) [NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1] CBOW: 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶) 15/31
  • 24.
    • Hierarchical Soft-maxfunction • To train weight matrix in every step, we need to pass the calculated vector into Loss-Function • Soft-max function • Before calculate loss function calculated vector should normalized as real-number in (0,1) Extension of Skip-Gram(1) 𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝 𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤 𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕 (𝑬 = − 𝐥𝐨𝐠 𝒑 𝒘 𝒕−𝑪. . 𝒘 𝒕+𝑪 𝒘 𝒕 ) 16/31
  • 25.
    • Hierarchical Soft-maxfunction (cont.) • Soft-max function (I have already calculated, it’s boring …….…) Extension of Skip-Gram(1) Original soft-max function of skip-gram model 17/31
  • 26.
    • Hierarchical Soft-maxfunction (cont.) • Since V is quite large, computing log(𝑝 𝑤𝑜 𝑤𝐼 ) costs to much • Idea: Construct binary Huffman tree with word  Cost: O( 𝑽 ) to O(lo g 𝑽 ) • Can train Faster! • Assigning • Word = 𝑛𝑜𝑑𝑒 𝑤, 𝐿 𝑤 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘 (* details in “Hierarchical Probabilistic Neural Network Language Model ") Extension of Skip-Gram(1) 18/31
  • 27.
    • Negative Sampling(similar to NCE) • Size(Vocabulary) is computationally huge!  Slow for train • Idea: Just sample several negative examples! • Do not loop full vocabulary, only use neg. sample  fast • Change the target word as negative sample and learn negative examples  get more accuracy • Objective function Extension of Skip-Gram(2) i.e. “Stock boil fish is toy” ????  negative sample Noise Constrastive Estimation 𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡) 19/31
  • 28.
    • Subsampling • (“Korea”,”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful • Idea: Frequent word vectors (i.e. “the”) should not change significantly after training on several million examples. • Each word 𝑤𝑖 in the training set is discarded with below probability • It aggressively subsamples frequent words while preserve ranking of the frequencies • But, this formula was chosen heuristically… Extension of Skip-Gram(3) f wi : 𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤𝑖 𝑡: 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑎𝑟𝑜𝑢𝑛𝑑 10−5 20/31
  • 29.
    • Evaluation • Task:Analogical reasoning • Accuracy test using cosine similarity determine how the model answer correctly. i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”) Accuracy = cosine_similarity( vec(X), vec(“Paris”) ) • Model: skip-gram model(Word-embedding dimension = 300) • Data Set: News article (Google dataset with 1 billion words) • Comparing Method (w/ or w/o 10-5subsampling) • NEG(Negative Sampling)-5, 15 • Hierarchical Softmax-Huffman • NCE-5(Noise Contrastive Estimation) Extension of Skip-Gram 21/31
  • 30.
    • Empirical Results •Model w/ NEG outperforms the HS on the analogical reasoning task (even slightly better than NCE) • The subsampling improves the training speed several times and makes the word representations more accurate Extension of Skip-Gram 22/31
  • 31.
    • Word basemodel can not represent idiomatic word • i.e. “Newyork Times”, “Larry Page” • Simple data driven approach • If phrases are formed based on 1-gram, 2-gram counts • Target words that has high score would meaningful phrase Learning Phrases 𝛿: 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 (P𝑟𝑒𝑣𝑒𝑛𝑡 𝑡𝑜𝑜 𝑚𝑎𝑛𝑦 𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠) 23/31
  • 32.
    • Evaluation • Task:Analogical reasoning • Accuracy test using cosine similarity determine how the model answer correctly with phrase • i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry Page”) Accuracy = cosine_similarity( vec(x), vec(“Google”) ) • Model: skip-gram model(Word-embedding dimension = 300) • Data Set: News article (Google dataset with 1 billion words) • Comparing Method (w/ or w/o 10-5subsampling) • NEG-5 • NEG-15 • HS-Huffman Learning Phrases 24/31
  • 33.
    • Empirical Results •NEG-15 achieves better performance than NEG-5 • HS become the best performing method when subsampling • This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases. • When training set = 33 billion, d=1000  72% (6B  66%) • Amount of training set is crucial! Learning Phrases 25/31
  • 34.
    • Simple vectoraddition (on Skip-gram model) • Previous experiments shows Analogical reasoning (A+B-C) • Vector’s values are related logarithmically to the probabilities  Sum of two vector is related to product of context distribution • Interesting! Additive Compositionality 26/31
  • 35.
    • Contributions • Showeddetailed process of training distributed representation of words and phrases • Can be more accurate and faster model than previous word2vec model by sub-sampling • Negative Sampling: Extremely simple and accurate for frequent words. (not frequent like phrase, HS was better) • Word vectors can be meaningful by simple vector addition • Made a code and dataset as open-source project Conclusion 27/31
  • 36.
    • Compare toother Neural network model <Find most similar word> • Skip-gram model trained on large corpus outperforms all to other paper’s models. Conclusion 28/31
  • 37.
    • Very Interestingmodel • Simple, short paper • Easy to read • Hard to understand detail • In HS, way of Tree construction • Several Heuristic methods • Pre-processing like eliminate stop-words Speaker’s Opinion 29/31
  • 38.
    • Papers • Mikolov,Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). • Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." Proceedings of the international workshop on artificial intelligence and statistics. 2005. • Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006). 2006. • Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014). • Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word- embedding method." arXiv preprint arXiv:1402.3722(2014). • Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine Learning Research 12 (2011): 2493-2537. • Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155. • Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3. • Websites & Courses • Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/) • http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html • http://nohhj.blogspot.kr/2015/08/word-embedding.html • https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/ • http://rare-technologies.com/word2vec-tutorial/ • https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482 • https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors References 30/31
  • 39.
    ? = Word2vec(“Slide”+ “End”) End 31/31

Editor's Notes

  • #16 It seems to be propose new notion but it’s old notion. Used chain rule and evaluate errors at every iterations -> Show probabilistic model to large scale set -> deep learning can be used for various NLP taks
  • #28 - Instead of looping over entire vocabulary, just sample several negative examples! & Good model can distinguish bad samples - build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not.