Word2Vec

Hyunyoung Lee
Seminar for NLP labs
Word2Vec

Agenda
1. Word Embedding
- Vectorization of Image and Text
Word2Vec
2. Word2Vec
- One-hot vector and Co-occurrence matrix for word vector
3. Fundamental
- Basic component of word embedding in a neural net
4. Word Vector in a neural net
5. Word2Vec, CBOW and skip-gram
- Comparing Image processing with Word Vector about vector presentation.
6. Glove

- Image Vector representation
1. Word Embedding Word2Vec
- RGB Values of every pixel like Height * Width * RGB as a value in a row
So it is easy to make the image a vector in some space, i.e. RGB space.

- What is the Word Embedding ?
1. Word Embedding Word2Vec
- In NLP tasks, Before a neural net,
Word vector is represented by Word frequency like TF-IDF and so on.
In a neural net, There are multiple tries for word vector representation :
- Language modeling and Word embedding modeling

One-hot representation
Dim = |V| (v is the size of vocabulary)
- motel
- hotel
If you search for [Seattle motel] key word, we want the search engine to match web page containing
“Seattle hotel”
Similarity(motal, hotel) = 0
motel
hotel = 0
If we do inner product with the above vectors, we can not find out similarity between words
2. Word2Vec Word2Vec
T

Co-occurrence matrix
Let’s see window based co-occurrence matrix
- Example Corpus :
- I like deep learning.
- I like NLP.
- I enjoy flying.
Total vocabulary size(|V|) = 8
Vector(“I”) = [0, 2, 1, 0, 0, 0, 0, 0]
Vector(“like”) = [2, 0, 0, 1, 0, 1, 0 , 0] …

Co-occurrence with SVD
With SVD(Singular Value Decomposition)
- this calculation is so expensive and not efficient. For example, for M * N matrix is O(mn )
- SVD based methods don’t scale well for big matrices, and it is too hard to incorporate new words
or documents
2

3. Fundamental Word2Vec
output layer’s values is regarded as :
- score
- probability
Backpropagation makes those value maximum or
minimum
Feedforward Neural Network(Basic Neural Network)

- Embedding Layer(Inner product)
3. Fundamental Word2Vec
- Intermediate Layer(s)
- Softmax Layer
- One or more layer that produce an intermediate representation of the input
For Example, Hidden layer with tanh, sigmoid activation function or RNN(LSTM, GRU) which is
state-of-the-art neural language models.
- The final layer to compute the probability distribution over words in total vocabulary.

Language model and Word embedding model with a neural net
- The main purpose of language model is to compute the probability of a sentence or sequence of
words and the probability of an upcoming word
The probability of a sequence of m words {W1, … , Wm} is denoted as P(W1, … , Wm)
P(W1, … , Wm) is conditioned on a widow of n previous words : P(Wt | Wt-1 , … , Wt-n+1)
i.e. The probability of a sentence or sequence of words :
The probability of an upcoming word :
- So, a model that computes either of those probability above is called a language model(LM)
- The Chain Rule applied to compute joint probability of words in sentence
Markov Assumption :
for example, P(“it water is so transparent”) =
P(its) * P(water | its) * P(is | its water) * P(so | its water is) * P(transparent | its water is so)
By Markov Assumption, the probability of the above sentence :
OR
4. Word Vector in a neural net Word2Vec

- How to estimate these probability
In N-gram based language model -
For example, bigram -
trigram -

- The first deep neural network architecture model For NLP presented by Bengio et al(2003) to predict
P(Wt | Wt-1 , … , Wt-n+1)
- This model is prototype which we now refer to as a word embedding.
There is some issue :
- softmax layer
- computing power
Classic neural language model (Bengio et al. 2003)

- A little more model than Begino et al is C&W model(2011)
There is some variation :
- changing cost function like the above
The C&W model without ranking objective(collobert et al. 2011)

- Another way to make word2vec in a neural net
- In NLP, transfer learning is word2vec, BUT Sometimes
we could make word2vec on the specific task using a neural net

Distributional similarity based representations
A lot of value by presenting a word by means of its neighbors
One of the most successful ideas of modern statistical NLP
5. Word2Vec, CBOW, skip-gram Word2Vec
Banking

Google’s Word2Vec – CBOW, skip-gram
Goal : simple (shallow) neural network model
Learning from billion words scale corpus
Predict middle word from neighbors with
A fixed size context window
1. Skip-gram
2. CBOW(continuous bag-of-words)

Skip-gram
Method : Predict neighbor Wt+j given word Wt
Maximizes following average log probability
Skip-gram(Mikolov et al. 2013)

CBOW
Method : Predict word given bag-of-neighbors
Loss function =
CBOW(Mikolov et al. 2013)

Skip-gram & CBOW
WV*N (WIN)and W’N*V (WOUT) is embedding layer.
N of these embedding layer is word2vec’s dimension

Let’s see an example of skip-gram

Word Analogies with Word2Vec
[king] – [man] + [woman] ≈ [queen]

Global statistics of co-occurrence probability
6. Glove Word2Vec

Global statistics of co-occurrence probability
6. Glove Word2Vec
Glove visualization Company – CEO Superlatives

Word2Vec vs Glove
6. Glove Word2Vec

Stanford lecture(Online)
CS224n : Natural Language Processing with Deep Learning
- lecture note1 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf
- lecture slide2 :http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf
CS223n : Convolutional Natural Networks For Visual Recognition
- lecture note : Neural networks Part 1: Setting up the Architecture http://cs231n.github.io/neural-networks-1/
- lecture note : Linear classification : Support Vector Machine, Softmax http://cs231n.github.io/linear-classify/
Sebastian Ruder blog : http://ruder.io/word-embeddings-1/index.html#fn:2
Colah’s blog : http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Neural Text Embedding for information Retrieval (WSDM 2017) by MicroSoft
Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings
of the International Conference on Learning Representations (ICLR 2013), 1–12
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionali
ty. NIPS, 1–9.
Reference Word2Vec
Private Blog
ACM International Conference on Web Search and Data mining
Paper

Word2Vec

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word2Vec

Similar to Word2Vec (20)

More from hyunyoung Lee

More from hyunyoung Lee (20)

Recently uploaded

Recently uploaded (20)

Word2Vec