Science in Data
Science
Science in Text Mining
by
Tanay Chowdhury
Data Scientist,Zurich North America
Word2vec model
• Creates the vector representation of every word using one-hot
encoding.
• The displacement vector (the vector between two vectors)
describes the relation between two words.
• Comparing displacement vectors helps find similarity.
Eg: vqueen−vking=vwoman−vman.
Skip gram model
It’s a simple neural network with a single hidden layer.
• For training words w1,w2, . . . ,wt proper position of output word wo for
input word wI is found by minimizing the loss function
• Weight update equation using stochastic gradient descent.
• Then back propagation for learning rate is
Skip gram model
Skip gram model
If we now re-compute the probability of each word given the input
word queen, we see that the probability of king given queen goes up
Softmax
regression
Weight
update Backpropagation
Example available at : https://github.com/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb
cbow model
cbow model
• If the context C comprises multiple words, instead of copying the
input vector we take the mean of their input vectors as our
hidden layer:
• The update functions remain the same except that for the
update of the input vectors, we need to apply the update to
each word in the context C:
•
Softmax
• Softmax regression (or multinomial logistic regression) is a generalization
of logistic regression to the case where we want to handle multiple
classes.
• In text classification every word in vocabulary is a label, making it a
multiclass problem.
• Hence using softmax regression, we can compute the posterior
probability P(wO|wI)
Hierarchical Softmax
• It consists of a binary Huffman tree based on word frequencies,
with each word wo as leaf.
• Probability of the path from root to the word causes
normalization.
• One N-way normalization is replaced by a sequence of O(logN)
local (binary) normalizations, as it is all about traversing subtree
with higher probability.
“.”makes
the result -1
or +1
Negative Sampling
• The objective is to distinguish the target word from draws from
the noise distribution using logistic regression, where there are k
negative samples for each data sample.
• References
• Gutmann and Hyvarinen 2012 . Noise-contrastive estimation of unnormalized statistical models, with
applications to natural image statistics.
• Mnih and The 2012. A fast and simple algorithm for training neural probabilistic language models.
Vectoring paragraph
• Paragraph Vector model attempts to learn fixed-length continuous
representations from variable-length pieces of text.
• Some of the attempts made in that direction are paragraph2vec,vector
averaging and clustering.
• Reference : Le & Mikolov (2014)
Pragraph2vec
Pragraph2vec
• At each step h is computed by concatenating or averaging a
paragraph vector d with a context of word vectors C
• The weight update functions are the same as in Word2Vec
except that paragraph vectors also gets updated.
Pragraph2vec
• Vector representation of a paragraph with different length
sentence.
Paragraph
w/17
different
words
Pragraph2vec
Paragraph
vector of
dimension
17x3
Document
vector of
dimension
5x3
Softmax
regression
Weight
update
Vector Averaging
• Vector averaging is the simplest technique to get the vector
representation for a sentence.
• It calculates the sum of all word vectors and divides that with
number of words.
• It’s considered fastest but not so accurate in result.
Clustering method
• It creates predetermined number of cluster of words.
• Synonymous words (mainly adjectives) ends up in same cluster.
• Rather than using bag of words, it uses bag of centroids
(counting the occurrence of a word in a certain cluster)then.
• Reference : https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors
Performance comparison
• Paragraph2vec is generally the most accurate one when used in
classification algorithm.
• Vector averaging is the fastest but clustering is considered to be
decent in terms of accuracy and speed.
Questions

Science in text mining

  • 1.
    Science in Data Science Sciencein Text Mining by Tanay Chowdhury Data Scientist,Zurich North America
  • 2.
    Word2vec model • Createsthe vector representation of every word using one-hot encoding. • The displacement vector (the vector between two vectors) describes the relation between two words. • Comparing displacement vectors helps find similarity. Eg: vqueen−vking=vwoman−vman.
  • 3.
    Skip gram model It’sa simple neural network with a single hidden layer.
  • 4.
    • For trainingwords w1,w2, . . . ,wt proper position of output word wo for input word wI is found by minimizing the loss function • Weight update equation using stochastic gradient descent. • Then back propagation for learning rate is Skip gram model
  • 5.
    Skip gram model Ifwe now re-compute the probability of each word given the input word queen, we see that the probability of king given queen goes up Softmax regression Weight update Backpropagation Example available at : https://github.com/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb
  • 6.
  • 7.
    cbow model • Ifthe context C comprises multiple words, instead of copying the input vector we take the mean of their input vectors as our hidden layer: • The update functions remain the same except that for the update of the input vectors, we need to apply the update to each word in the context C: •
  • 8.
    Softmax • Softmax regression(or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. • In text classification every word in vocabulary is a label, making it a multiclass problem. • Hence using softmax regression, we can compute the posterior probability P(wO|wI)
  • 9.
    Hierarchical Softmax • Itconsists of a binary Huffman tree based on word frequencies, with each word wo as leaf. • Probability of the path from root to the word causes normalization. • One N-way normalization is replaced by a sequence of O(logN) local (binary) normalizations, as it is all about traversing subtree with higher probability. “.”makes the result -1 or +1
  • 10.
    Negative Sampling • Theobjective is to distinguish the target word from draws from the noise distribution using logistic regression, where there are k negative samples for each data sample. • References • Gutmann and Hyvarinen 2012 . Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. • Mnih and The 2012. A fast and simple algorithm for training neural probabilistic language models.
  • 11.
    Vectoring paragraph • ParagraphVector model attempts to learn fixed-length continuous representations from variable-length pieces of text. • Some of the attempts made in that direction are paragraph2vec,vector averaging and clustering. • Reference : Le & Mikolov (2014)
  • 12.
  • 13.
    Pragraph2vec • At eachstep h is computed by concatenating or averaging a paragraph vector d with a context of word vectors C • The weight update functions are the same as in Word2Vec except that paragraph vectors also gets updated.
  • 14.
    Pragraph2vec • Vector representationof a paragraph with different length sentence. Paragraph w/17 different words
  • 15.
  • 16.
    Vector Averaging • Vectoraveraging is the simplest technique to get the vector representation for a sentence. • It calculates the sum of all word vectors and divides that with number of words. • It’s considered fastest but not so accurate in result.
  • 17.
    Clustering method • Itcreates predetermined number of cluster of words. • Synonymous words (mainly adjectives) ends up in same cluster. • Rather than using bag of words, it uses bag of centroids (counting the occurrence of a word in a certain cluster)then. • Reference : https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors
  • 18.
    Performance comparison • Paragraph2vecis generally the most accurate one when used in classification algorithm. • Vector averaging is the fastest but clustering is considered to be decent in terms of accuracy and speed.
  • 19.