WORD2VEC
M.Javad Hasani
1
Outline • Goal
• History
• Word Embedding
• Introduction toWord2Vec
• CBOW
• Skip-Gram
• Parameters
• Implementations
• Other usecases
2
When? Who?
• Word2vec was created by a team of
researchers led by Tomas Mikolov at Google.
• Embedding vectors created using the
Word2vec algorithm have many advantages
compared to earlier algorithmssuch as :
latent semantic analysis.
2013
3
Goal:
Reconstruct linguistic
contexts of words
context
words
Word2Vec
Target
Word
WordWord2Vec
Context
words
4
Tasks
(WATER – WET ) + FIRE = FLAMES
(PARIS - FRANCE) + ITALY = ROME
(WINTER - COLD) + SUMMER = WARM
(KING - MAN) +WOMAN = QUEEN
5
Why vector
space?
similar
distributions
similar
meanings
6
Vector space: word embeddings
7
word
embedding
A technique to turn
words into numbers
to use by many of the machine learning
algorithms
8
One-hot
vector
simple word representation
• Vector length is equal to dictionary size
• Any vector has one non-zero element
9
Types of
Word
Embeddings
Frequency
based
CountVector
TF-IDF
Vector
Co-
Occurrence
Vector
Prediction
based
CBOW
Skip – Gram
10
What is
word2vec?
11
What is
word2vec?
• Word2vec is a combination of two
techniques
– CBOW(Continuous bag of words)
– Skip-gram model.
• Both of these map word(s) to
word(s).
• learn weights which act as word
vector representations.
Skip-
gram
CBOW
12
How it
works?
1. Both input word wi and the output word wj are one-hot
encoded into binary vectors x and y of size V.
2. First, the multiplication of the binary vector xx and the
word embedding matrix W of size V×N gives us the
embedding vector of the input word wi: the i-th row of
the matrix W.
3. The multiplication of the hidden layer and the word
context matrix W′ of size N×W produces the output
one-hot encoded vector y.
13
Embedding
matrix X x W=v
14
Training Samples
By
sibling window
15
CBOW
(Continuous Bag of words)
Skip-gram
Syntactic relation Semantic relation
16
Loss
Functions
Full Softmax
Hierarchical Softmax
Cross Entropy
Noise Contrastive Estimation (NCE)
Negative Sampling (NEG)
17
Softmax
Full Hierarchical
18
Parametrization • Sub-sampling
– High frequency words often provide little information.
• Dimensionality
– Quality of word embedding increases with higher
dimensionality.
– But after reaching some point, marginal gain will
diminish.
– Typically, the dimensionality of the vectors is set to be
between 100 and 1,000.
• Context window
– The recommended value is 10 for skip-gram and 5 for
CBOW.
19
Result
https://ronxin.github.io/wevi/ 20
Variants models class
• documents to vector spaceDoc2vec
• There are a lot of noisy text and informal
language structure.tweet2vec
• dealing with item and user similarity is at heart
of lot of recommendation algorithmsitem2vec
• this embedding technique tries to marry best of
both worlds, word2vec and LDALda2vec
21
Implementation
s
23
Thanks
24

Word2Vec