Word representations in vector space

Paper Presentation
Word Representations
in Vector space
Abdullah Khan Zehady
Department of Computer Science,
Purdue University.
E-mail: azehady@purdue.edu

Neural Word Embedding
● Continuous vector space representation
o Words represented as dense real-valued vectors in Rd
● Distributed word representation ↔ Word Embedding
o Embed an entire vocabulary into a relatively low-dimensional linear
space where dimensions are latent continuous features.
● Classical n-gram model works in terms of discrete units
o No inherent relationship in n-gram.
● In contrast, word embeddings capture regularities and relationships
between words.

Syntactic & Semantic Relationship
Regularities are observed as the constant offset vector between
pair of words sharing some relationship.
Gender Relation
KING-QUEEN ~ MAN - WOMAN
Singular/Plural Relation
KING-KINGS ~ QUEEN - QUEENS
Other Relations:
● Language
France - French
~
Spain - Spanish
● Past Tense
Go – Went
~
Capture - Captured

Vector Space Model
Language 1: English
Language 2: Estonian

Neural Net
Hidden
Layer
Input
Layer
Output
Layer

Language Model(LM)
● Different models for estimating continuous representations of
words.
○ Latent Semantic Analysis (LSA)
○ Latent Dirichlet Allocation (LDA)
○ Neural network Language Model(NNLM)

Feed Forward NNLM
● Consists of input, projection, hidden and output layers.
● N previous words are encoded using 1-of-V coding, where V is size of the
vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26
● NNLM becomes computationally complex between projection(P) and
hidden(H) layer
○ For N=10, size of P = 500-2000, size of H = 500-1000
○ Hidden layer is used to compute prob. dist. over all the words in
vocabulary V
● Hierarchical softmax as the rescue.

Recurrent NNLM
● No projection Layer, consists of input, hidden and output layers only.
● No need to specify the context length like feed forward NNLM
● What is special in RNN model?
○ Recurrent matrix that connects layer to itself.
○ Allows to form short-term memory
■ Information from the past is represen-
ted by the hidden layer
● RNN-embedded vector achieved state of the
art results in relational similarity identification task.
RNN Model

Recurrent NNLM
w(t): Input word at time t
y(t): Output layer produces a prob. Dist.
over words.
s(t): Hidden layer
U: Each column represents a word
● Four-gram neural net language model architecture(Bengio 2001)
● RNN is trained with SGD and backpropagation to maximize the
● log likelihood.

Bringing efficiency..
● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)

Continuous Bag of Word Model(CBOW)
● Non-linear hidden layer is removed
● Projection layer is shared for all words(not
just the projection matrix).
● All words get projected into the same
position(vectors are averaged).
● Naming Reson: Order of words in the
history does not influence the projection.
● Best performance obtained by a log-
linear classifier with four future and
four history words at the input
Predicts the current word based on
the context.

Continuous Skip-gram Model
● Objective: Tries to maximize
classification of a word based on another
word in the same sentence. Maximize the
average log probability
● Define p(wt+j |wt ) using the softmax
function:
Predicts surrounding word given
the current word.

Hierarchical Softmax for efficient computation
● This formulation is impractical because the cost of computing ∇logp(wO|wI)
is proportional to W, which is often large (105–107 terms).
● With hierarchical softmax, the cost is reduced

Hierarchical Softmax
● Uses a binary tree (Huffman code) representation of the output layer with the W
words as its leaves.
o A random walk that assigns probabilities to words.
● Instead of evaluating W output nodes, evaluate log(W) nodes to calculate prob. dist.
● Each word w can be reached by an appropriate path from the root of the tree● n(w, j): j-th node on the path from the root to w
● L(w): The length of this path
● n(w, 1) = root and n(w, L(w)) = w
● ch(n): An arbitrary fixed child of an inner node n
● [x] = 1 if x is true and [x] = -1 otherwise

Negative Sampling
● Noise Contrastive Estimation (NCE)
o A good model should be able to differentiate data from noise by means of
logistic regression.
o Alternative to the hierarchical softmax.
o Introduced by Gutmann and Hyvarinen and applied to language modeling by
Mnih and Teh.
● NCE approximates the log probability of the softmax
● Define Negative Sampling by the objective which replaces log P(w0|wI) in the skip-
gram.
● Task: Distinguish the target word wO from draws from the noise distribution

Subsampling of Frequent words
● Most frequent words provide less information than rare words.
o Co-occurrences of “France” and “Paris” is informative
o Co-occurrences of “France” and “the” is less informative
● A simple subsampling approach to counter imbalance
o Each word wi in the training set is discarded with probability
where f(wi) is the frequency of word wi and t is a chosen threshold,
typically around 10−5
● Aggressive subsampling of words whose frequency is greater than
t while preserving the ranking of the frequencies.

Automatic learning by skip-gram model
● No supervised information
about what a capital city
means.
● But the model is still
capable of
o Automatic
organization of
concepts
o Learning implicit
relationship
PCA projection of 100- dimensional skip-gram vectors

Analogical Reasoning Performance
● Analogical Reasoning task introduced by Mikolov
o Syntactic analogies: “quick” : “quickly” :: “slow” : ? “slowly”
o Semantic analogies: “Germany” : “Berlin” :: “France” : ? “Paris”

Learning Phrases
● To learn phrase vectors
o First find words that appear frequently together, and infrequently in
other contexts.
o Replace with unique tokens. Ex: “New York Times” ->
New_York_Times
● Phrases are formed based on the unigram and bigram counts, using
δ(discounting coefficient) prevents too many phrases consisting of very

Learning Phrases
Goal: Compute the fourth phrase using the first three.
(Best model accuracy: 72%)

Phrase Skip-gram Results
● Accuracies of the Skip-gram models on the phrase analogy dataset
o Using different hyperparameters
o Models trained on approximately one billion words from the news
dataset
● Size of the training data matters.
o HS-Huffman( dimensionality=1000) trained on 33 billion words
reaches an accuracy of 72%

Additive compositionality
● Possible to meaningfully combine words by an element-wise addition of their
vector representations.
○ Word vectors represents the distribution of the context in which it appears.
● Vector values related logarithmically to the probabilities computed by output layer.
○ The sum of two word vectors is related to the product of the two context
distributions

Closest Entities
Closest entity search using two methods-Negative sampling and Hierarchical Softmax.

Compare with published word representations

Comments
● Reduction of computational complexity is impressive.
● Works with unsupervised/unlabelled data
● Vector representation can be extended to large pieces of text
Paragraph Vector (Mikolov et al. 2013)
● Applicable to a lot of NLP tasks
o Tagging
o Named Entity Recognition
o Translation
o Paraphrasing

Word representations in vector space

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word representations in vector space

Similar to Word representations in vector space (20)

More from Abdullah Khan Zehady

More from Abdullah Khan Zehady (17)

Recently uploaded

Recently uploaded (20)

Word representations in vector space

Editor's Notes