Hyunyoung Lee
Seminar for NLP labs
Word2Vec
Agenda
1. Word Embedding
- Vectorization of Image and Text
Word2Vec
2. Word2Vec
- One-hot vector and Co-occurrence matrix for word vector
3. Fundamental
- Basic component of word embedding in a neural net
4. Word Vector in a neural net
5. Word2Vec, CBOW and skip-gram
- Comparing Image processing with Word Vector about vector presentation.
6. Glove
- Image Vector representation
1. Word Embedding Word2Vec
- RGB Values of every pixel like Height * Width * RGB as a value in a row
So it is easy to make the image a vector in some space, i.e. RGB space.
- What is the Word Embedding ?
1. Word Embedding Word2Vec
- In NLP tasks, Before a neural net,
Word vector is represented by Word frequency like TF-IDF and so on.
In a neural net, There are multiple tries for word vector representation :
- Language modeling and Word embedding modeling
One-hot representation
Dim = |V| (v is the size of vocabulary)
- motel
- hotel
If you search for [Seattle motel] key word, we want the search engine to match web page containing
“Seattle hotel”
Similarity(motal, hotel) = 0
motel
hotel = 0
If we do inner product with the above vectors, we can not find out similarity between words
2. Word2Vec Word2Vec
T
Co-occurrence matrix
Let’s see window based co-occurrence matrix
- Example Corpus :
- I like deep learning.
- I like NLP.
- I enjoy flying.
Total vocabulary size(|V|) = 8
Vector(“I”) = [0, 2, 1, 0, 0, 0, 0, 0]
Vector(“like”) = [2, 0, 0, 1, 0, 1, 0 , 0] …
2. Word2Vec Word2Vec
Co-occurrence with SVD
With SVD(Singular Value Decomposition)
- this calculation is so expensive and not efficient. For example, for M * N matrix is O(mn )
- SVD based methods don’t scale well for big matrices, and it is too hard to incorporate new words
or documents
2. Word2Vec Word2Vec
2
3. Fundamental Word2Vec
output layer’s values is regarded as :
- score
- probability
Backpropagation makes those value maximum or
minimum
Feedforward Neural Network(Basic Neural Network)
- Embedding Layer(Inner product)
3. Fundamental Word2Vec
- Intermediate Layer(s)
- Softmax Layer
- One or more layer that produce an intermediate representation of the input
For Example, Hidden layer with tanh, sigmoid activation function or RNN(LSTM, GRU) which is
state-of-the-art neural language models.
- The final layer to compute the probability distribution over words in total vocabulary.
Language model and Word embedding model with a neural net
- The main purpose of language model is to compute the probability of a sentence or sequence of
words and the probability of an upcoming word
The probability of a sequence of m words {W1, … , Wm} is denoted as P(W1, … , Wm)
P(W1, … , Wm) is conditioned on a widow of n previous words : P(Wt | Wt-1 , … , Wt-n+1)
i.e. The probability of a sentence or sequence of words :
The probability of an upcoming word :
- So, a model that computes either of those probability above is called a language model(LM)
- The Chain Rule applied to compute joint probability of words in sentence
Markov Assumption :
for example, P(“it water is so transparent”) =
P(its) * P(water | its) * P(is | its water) * P(so | its water is) * P(transparent | its water is so)
By Markov Assumption, the probability of the above sentence :
OR
4. Word Vector in a neural net Word2Vec
Language model and Word embedding model with a neural net
- How to estimate these probability
In N-gram based language model -
For example, bigram -
trigram -
4. Word Vector in a neural net Word2Vec
Language model and Word embedding model with a neural net
- The first deep neural network architecture model For NLP presented by Bengio et al(2003) to predict
P(Wt | Wt-1 , … , Wt-n+1)
- This model is prototype which we now refer to as a word embedding.
There is some issue :
- softmax layer
- computing power
4. Word Vector in a neural net Word2Vec
Classic neural language model (Bengio et al. 2003)
Language model and Word embedding model with a neural net
- A little more model than Begino et al is C&W model(2011)
There is some variation :
- changing cost function like the above
4. Word Vector in a neural net Word2Vec
The C&W model without ranking objective(collobert et al. 2011)
Language model and Word embedding model with a neural net
- Another way to make word2vec in a neural net
- In NLP, transfer learning is word2vec, BUT Sometimes
we could make word2vec on the specific task using a neural net
4. Word Vector in a neural net Word2Vec
Distributional similarity based representations
A lot of value by presenting a word by means of its neighbors
One of the most successful ideas of modern statistical NLP
5. Word2Vec, CBOW, skip-gram Word2Vec
Banking
Google’s Word2Vec – CBOW, skip-gram
Goal : simple (shallow) neural network model
Learning from billion words scale corpus
Predict middle word from neighbors with
A fixed size context window
1. Skip-gram
2. CBOW(continuous bag-of-words)
5. Word2Vec, CBOW, skip-gram Word2Vec
Skip-gram
Method : Predict neighbor Wt+j given word Wt
Maximizes following average log probability
5. Word2Vec, CBOW, skip-gram Word2Vec
Skip-gram(Mikolov et al. 2013)
CBOW
Method : Predict word given bag-of-neighbors
Loss function =
5. Word2Vec, CBOW, skip-gram Word2Vec
CBOW(Mikolov et al. 2013)
Skip-gram & CBOW
WV*N (WIN)and W’N*V (WOUT) is embedding layer.
N of these embedding layer is word2vec’s dimension
5. Word2Vec, CBOW, skip-gram Word2Vec
Let’s see an example of skip-gram
5. Word2Vec, CBOW, skip-gram Word2Vec
Word Analogies with Word2Vec
[king] – [man] + [woman] ≈ [queen]
5. Word2Vec, CBOW, skip-gram Word2Vec
Word Analogies with Word2Vec
[king] – [man] + [woman] ≈ [queen]
5. Word2Vec, CBOW, skip-gram Word2Vec
Global statistics of co-occurrence probability
6. Glove Word2Vec
Global statistics of co-occurrence probability
6. Glove Word2Vec
Glove visualization Company – CEO Superlatives
Word2Vec vs Glove
6. Glove Word2Vec
Stanford lecture(Online)
CS224n : Natural Language Processing with Deep Learning
- lecture note1 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf
- lecture note2 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes2.pdf
- lecture note5 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes5.pdf
- lecture slide2 :http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf
CS223n : Convolutional Natural Networks For Visual Recognition
- lecture note : Neural networks Part 1: Setting up the Architecture http://cs231n.github.io/neural-networks-1/
- lecture note : Linear classification : Support Vector Machine, Softmax http://cs231n.github.io/linear-classify/
Sebastian Ruder blog : http://ruder.io/word-embeddings-1/index.html#fn:2
Colah’s blog : http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Neural Text Embedding for information Retrieval (WSDM 2017) by MicroSoft
Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings
of the International Conference on Learning Representations (ICLR 2013), 1–12
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionali
ty. NIPS, 1–9.
Reference Word2Vec
Private Blog
ACM International Conference on Web Search and Data mining
Paper

Word2Vec

  • 1.
    Hyunyoung Lee Seminar forNLP labs Word2Vec
  • 2.
    Agenda 1. Word Embedding -Vectorization of Image and Text Word2Vec 2. Word2Vec - One-hot vector and Co-occurrence matrix for word vector 3. Fundamental - Basic component of word embedding in a neural net 4. Word Vector in a neural net 5. Word2Vec, CBOW and skip-gram - Comparing Image processing with Word Vector about vector presentation. 6. Glove
  • 3.
    - Image Vectorrepresentation 1. Word Embedding Word2Vec - RGB Values of every pixel like Height * Width * RGB as a value in a row So it is easy to make the image a vector in some space, i.e. RGB space.
  • 4.
    - What isthe Word Embedding ? 1. Word Embedding Word2Vec - In NLP tasks, Before a neural net, Word vector is represented by Word frequency like TF-IDF and so on. In a neural net, There are multiple tries for word vector representation : - Language modeling and Word embedding modeling
  • 5.
    One-hot representation Dim =|V| (v is the size of vocabulary) - motel - hotel If you search for [Seattle motel] key word, we want the search engine to match web page containing “Seattle hotel” Similarity(motal, hotel) = 0 motel hotel = 0 If we do inner product with the above vectors, we can not find out similarity between words 2. Word2Vec Word2Vec T
  • 6.
    Co-occurrence matrix Let’s seewindow based co-occurrence matrix - Example Corpus : - I like deep learning. - I like NLP. - I enjoy flying. Total vocabulary size(|V|) = 8 Vector(“I”) = [0, 2, 1, 0, 0, 0, 0, 0] Vector(“like”) = [2, 0, 0, 1, 0, 1, 0 , 0] … 2. Word2Vec Word2Vec
  • 7.
    Co-occurrence with SVD WithSVD(Singular Value Decomposition) - this calculation is so expensive and not efficient. For example, for M * N matrix is O(mn ) - SVD based methods don’t scale well for big matrices, and it is too hard to incorporate new words or documents 2. Word2Vec Word2Vec 2
  • 8.
    3. Fundamental Word2Vec outputlayer’s values is regarded as : - score - probability Backpropagation makes those value maximum or minimum Feedforward Neural Network(Basic Neural Network)
  • 9.
    - Embedding Layer(Innerproduct) 3. Fundamental Word2Vec - Intermediate Layer(s) - Softmax Layer - One or more layer that produce an intermediate representation of the input For Example, Hidden layer with tanh, sigmoid activation function or RNN(LSTM, GRU) which is state-of-the-art neural language models. - The final layer to compute the probability distribution over words in total vocabulary.
  • 10.
    Language model andWord embedding model with a neural net - The main purpose of language model is to compute the probability of a sentence or sequence of words and the probability of an upcoming word The probability of a sequence of m words {W1, … , Wm} is denoted as P(W1, … , Wm) P(W1, … , Wm) is conditioned on a widow of n previous words : P(Wt | Wt-1 , … , Wt-n+1) i.e. The probability of a sentence or sequence of words : The probability of an upcoming word : - So, a model that computes either of those probability above is called a language model(LM) - The Chain Rule applied to compute joint probability of words in sentence Markov Assumption : for example, P(“it water is so transparent”) = P(its) * P(water | its) * P(is | its water) * P(so | its water is) * P(transparent | its water is so) By Markov Assumption, the probability of the above sentence : OR 4. Word Vector in a neural net Word2Vec
  • 11.
    Language model andWord embedding model with a neural net - How to estimate these probability In N-gram based language model - For example, bigram - trigram - 4. Word Vector in a neural net Word2Vec
  • 12.
    Language model andWord embedding model with a neural net - The first deep neural network architecture model For NLP presented by Bengio et al(2003) to predict P(Wt | Wt-1 , … , Wt-n+1) - This model is prototype which we now refer to as a word embedding. There is some issue : - softmax layer - computing power 4. Word Vector in a neural net Word2Vec Classic neural language model (Bengio et al. 2003)
  • 13.
    Language model andWord embedding model with a neural net - A little more model than Begino et al is C&W model(2011) There is some variation : - changing cost function like the above 4. Word Vector in a neural net Word2Vec The C&W model without ranking objective(collobert et al. 2011)
  • 14.
    Language model andWord embedding model with a neural net - Another way to make word2vec in a neural net - In NLP, transfer learning is word2vec, BUT Sometimes we could make word2vec on the specific task using a neural net 4. Word Vector in a neural net Word2Vec
  • 15.
    Distributional similarity basedrepresentations A lot of value by presenting a word by means of its neighbors One of the most successful ideas of modern statistical NLP 5. Word2Vec, CBOW, skip-gram Word2Vec Banking
  • 16.
    Google’s Word2Vec –CBOW, skip-gram Goal : simple (shallow) neural network model Learning from billion words scale corpus Predict middle word from neighbors with A fixed size context window 1. Skip-gram 2. CBOW(continuous bag-of-words) 5. Word2Vec, CBOW, skip-gram Word2Vec
  • 17.
    Skip-gram Method : Predictneighbor Wt+j given word Wt Maximizes following average log probability 5. Word2Vec, CBOW, skip-gram Word2Vec Skip-gram(Mikolov et al. 2013)
  • 18.
    CBOW Method : Predictword given bag-of-neighbors Loss function = 5. Word2Vec, CBOW, skip-gram Word2Vec CBOW(Mikolov et al. 2013)
  • 19.
    Skip-gram & CBOW WV*N(WIN)and W’N*V (WOUT) is embedding layer. N of these embedding layer is word2vec’s dimension 5. Word2Vec, CBOW, skip-gram Word2Vec
  • 20.
    Let’s see anexample of skip-gram 5. Word2Vec, CBOW, skip-gram Word2Vec
  • 21.
    Word Analogies withWord2Vec [king] – [man] + [woman] ≈ [queen] 5. Word2Vec, CBOW, skip-gram Word2Vec
  • 22.
    Word Analogies withWord2Vec [king] – [man] + [woman] ≈ [queen] 5. Word2Vec, CBOW, skip-gram Word2Vec
  • 23.
    Global statistics ofco-occurrence probability 6. Glove Word2Vec
  • 24.
    Global statistics ofco-occurrence probability 6. Glove Word2Vec Glove visualization Company – CEO Superlatives
  • 25.
    Word2Vec vs Glove 6.Glove Word2Vec
  • 26.
    Stanford lecture(Online) CS224n :Natural Language Processing with Deep Learning - lecture note1 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf - lecture note2 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes2.pdf - lecture note5 : http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes5.pdf - lecture slide2 :http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf CS223n : Convolutional Natural Networks For Visual Recognition - lecture note : Neural networks Part 1: Setting up the Architecture http://cs231n.github.io/neural-networks-1/ - lecture note : Linear classification : Support Vector Machine, Softmax http://cs231n.github.io/linear-classify/ Sebastian Ruder blog : http://ruder.io/word-embeddings-1/index.html#fn:2 Colah’s blog : http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ Neural Text Embedding for information Retrieval (WSDM 2017) by MicroSoft Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionali ty. NIPS, 1–9. Reference Word2Vec Private Blog ACM International Conference on Web Search and Data mining Paper