Science in text mining

Science in Data
Science
Science in Text Mining
by
Tanay Chowdhury
Data Scientist,Zurich North America

Word2vec model
• Creates the vector representation of every word using one-hot
encoding.
• The displacement vector (the vector between two vectors)
describes the relation between two words.
• Comparing displacement vectors helps find similarity.
Eg: vqueen−vking=vwoman−vman.

Skip gram model
It’s a simple neural network with a single hidden layer.

• For training words w1,w2, . . . ,wt proper position of output word wo for
input word wI is found by minimizing the loss function
• Weight update equation using stochastic gradient descent.
• Then back propagation for learning rate is
Skip gram model

Skip gram model
If we now re-compute the probability of each word given the input
word queen, we see that the probability of king given queen goes up
Softmax
regression
Weight
update Backpropagation
Example available at : https://github.com/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb

cbow model
• If the context C comprises multiple words, instead of copying the
input vector we take the mean of their input vectors as our
hidden layer:
• The update functions remain the same except that for the
update of the input vectors, we need to apply the update to
each word in the context C:
•

Softmax
• Softmax regression (or multinomial logistic regression) is a generalization
of logistic regression to the case where we want to handle multiple
classes.
• In text classification every word in vocabulary is a label, making it a
multiclass problem.
• Hence using softmax regression, we can compute the posterior
probability P(wO|wI)

Hierarchical Softmax
• It consists of a binary Huffman tree based on word frequencies,
with each word wo as leaf.
• Probability of the path from root to the word causes
normalization.
• One N-way normalization is replaced by a sequence of O(logN)
local (binary) normalizations, as it is all about traversing subtree
with higher probability.
“.”makes
the result -1
or +1

Negative Sampling
• The objective is to distinguish the target word from draws from
the noise distribution using logistic regression, where there are k
negative samples for each data sample.
• References
• Gutmann and Hyvarinen 2012 . Noise-contrastive estimation of unnormalized statistical models, with
applications to natural image statistics.
• Mnih and The 2012. A fast and simple algorithm for training neural probabilistic language models.

Vectoring paragraph
• Paragraph Vector model attempts to learn fixed-length continuous
representations from variable-length pieces of text.
• Some of the attempts made in that direction are paragraph2vec,vector
averaging and clustering.
• Reference : Le & Mikolov (2014)

Pragraph2vec
• At each step h is computed by concatenating or averaging a
paragraph vector d with a context of word vectors C
• The weight update functions are the same as in Word2Vec
except that paragraph vectors also gets updated.

Pragraph2vec
• Vector representation of a paragraph with different length
sentence.
Paragraph
w/17
different
words

Pragraph2vec
Paragraph
vector of
dimension
17x3
Document
vector of
dimension
5x3
Softmax
regression
Weight
update

Vector Averaging
• Vector averaging is the simplest technique to get the vector
representation for a sentence.
• It calculates the sum of all word vectors and divides that with
number of words.
• It’s considered fastest but not so accurate in result.

Clustering method
• It creates predetermined number of cluster of words.
• Synonymous words (mainly adjectives) ends up in same cluster.
• Rather than using bag of words, it uses bag of centroids
(counting the occurrence of a word in a certain cluster)then.
• Reference : https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

Performance comparison
• Paragraph2vec is generally the most accurate one when used in
classification algorithm.
• Vector averaging is the fastest but clustering is considered to be
decent in terms of accuracy and speed.

Science in text mining

More Related Content

What's hot

Viewers also liked

Similar to Science in text mining

Recently uploaded

Science in text mining