Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dorra elmekki nlp
1. A Document Descriptor using
Covariance of Word Vectors
Presented by: DORRA EL MEKKI
arwan Torki , A Document Descriptor using Covariance of Word Vectors ,56th Annual Meeting of the A
omputational Linguistics (Short Papers), pages 527–532 Melbourne, Australia, July 15 - 20, 2018.
4. State-of-the-art methods
4
Bag-of-Words(BOW)
The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text is
represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.
Latent Semantic Indexing(LSI)
Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method,
It finds the hidden relationships between words in order to improve information understanding
Deep learning methods
Introduction of neural language models using deep learning allowed to learn word vector representation
5. The added value
5
Interrelationship of
words in the text
Interrelationship
between the dimensions
of the word embedding
via the covariance
matrix elements
10. The IMDB movie review dataset
10
25%
25%
50%
The IMDB movie review dataset
labelled training instances labelled test instances unlabelled training instances
11. Objectives of the experience
Objectif 1
The DoCoV descriptor can be used with different
alternatives for word representations
Objectif 2
Pre-trained models are giving the best results. This
alleviates the need of computing a problem specific
word embedding
12. 01
02
03
Before Training
The Power of PowerPoint | thepopp.com 12
Step 1
Using the Training and unlabelled subsets of IMDB
dataset to obtain different embedding by setting
number of dimensions to 100, 200 and 300.
Step 2
Using pre-trained GloVe models trained on
wikipedia2014 and Gigaword5.
Step 3
Using pre-trained word2vec model trained
on Google news. We call it Gnews.
13.
14. Observation 2 The best performing feature concatenation is
DoCoV+BOW. This ensures that the
concatenation in fact is benefiting from both
representations
Observation:
The Power of PowerPoint | thepopp.com 14
Observation 1 The DoCoV is consistently outperforming the Mean vector
for different dimensionality of the word embedding
Observation 3 In general the best results are achieved using
the available 300-dimensions Gnews word
embedding
16. 16
Conclusion
Generic
which makes it useful for different supervised and
unsupervised tasks
Fixed-length
property
which makes it useful for
different learning algorithms
Better performance
against other state-of-the-art methods.
Minimal training
We do not require a encoder-decoder
model or a gradient descent iterations to
be computed.
in computing, a data descriptor is a structure containing information that describes data.
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. ... In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), so the covariance is negative.
Word vectors or we can say word embeddings : it simply converts words into vectors,
For word embedding to be good we kind of require that the vectors carry some meaning
So if you put in hamburger and cheeseburger into my model, I want those vectors to be close to each other cause they are very related words,
We want also the diff between vectors to carry some meaning,
For example
Man-woman+queen = king
I want a to get the vector of a word related to king
In this paper , we gonna discuss how using covariance of word vectors may be useful, for that I m gonna pursue this plan
Retrieving documents that are similar to a query using vectors has a long history,
so the added value of this paper is not about representing words as vectors but about using covariance of word vectors,
To better understand the topic, let’s see some earlier methods modeled documents and queries using vector space,
1/The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
2/Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method,
It finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing)
3/Introduction of neural language models using deep learning allowed to learn word vector representation (word embedding for simplicity)
DoCov obtains a fixed length representation of the paragraph which captures the interrelationship between the dimensions of the word embedding via the covariance matrix elements
Instead studying the interrelationship of words in the text,
We present our DoCoV descriptor.
First, we define a document observation matrix.
Second, we show how to extract our DoCoV descriptor.
Given a d-dimensional word embedding model and an n-terms document. We can define a document observation matrix O ∈ R n×d . In the matrix O, a row represents a term in the document and columns represent the d-dimensional word embedding representation for that term.
Assume that we have observed n terms of a d-dimensional random variable; we have a data matrix O(n × d) :
The rows xi = x1 x2 · · · xd T ∈ Rd , denote the i-th observation of a d-dimensional random variable X ∈ Rd .
The “sample mean vector” of the n observations ∈ Rd is given by the vector x¯ of the means x¯j of the d variables: x¯ = x¯1 x¯2 · · · x¯d T ∈ R d
Given an observation matrix O for a document, we compute the covariance matrix entries for every pair of dimensions (X, Y ).
The matrix C ∈ R d×d is a symmetric matrix and is defined as
Now we can move to the experimental evaluation, In this part, We show an extensive comparative evaluation for unsupervised paragraph representation approaches.
We evaluate classification performance over the IMDB movie reviews dataset using error rate as the evaluation measure
The dataset consists of 100K IMDB movie reviews and each review has several sentences. The 100K reviews are divided into three datasets: 25% labelled training instances, 25% labelled test instances and 50% unlabelled training instances. Each review has one label representing the sentiment of it: Positive or Negative. These labels are balanced in both the training and the test set.
The objective is to show that theDoCoV descriptor can be used with different alternatives for word representations.
Also, the experiment shows that pre-trained models are giving the best results, namely the word2vec model built on Google news. This alleviates the need of computing a problem specific word embedding.
In some cases there is no available data to construct the word embedding. To illustrate that we tried different alternatives for word representation
We used the Training and unlabelled subsets of IMDB dataset to obtain different embedding by setting number of dimensions to 100, 200 and 300.
We used pre-trained GloVe models trained on wikipedia2014 and Gigaword5.
We used pre-trained word2vec model trained on Google news. We call it Gnews. This model provides word vectors of 300 dimensions for each word.
Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.
WER=s+D+I/N
where
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct words,
N is the number of words in the reference (N=S+D+C)
Error-Rate performance when changing word vectors dimensionality.
Table 1 shows the results when using DoCoV computed at different dimensions of word embedding in classification. The table also compares classification performance when using DoCoV to the performance when using the Mean of word embedding as a baseline. Also, we show the effect of fusing DoCoV with other feature sets. We mainly experiment with the following sets: DoCoV, Mean, and bag-of-words (BOW). We use the mean and DoCoV features
From the results we can observe the following
We observe that the DoCoV is consistently outperforming the Mean vector for different dimensionality of the word embedding regardless of the embedding source.
The best performing feature concatenation is DoCoV+BOW. This ensures that the concatenation in fact is benefiting from both representations.
In general the best results are achieved using the available 300-dimensions Gnews word embedding. In the subsequent experiments we will use that embedding such that we do not need to build a different word embedding for every task on hand.
We presented a novel descriptor to represent text on any level such as sentences, paragraphs or documents.
Our representation is generic which makes it useful for different supervised and unsupervised tasks.
It has fixed-length property which makes it useful for different learning algorithms.
Also, our descriptor requires minimal training. We do not require a encoder-decoder model or a gradient descent iterations to be computed.
Empirically we showed the effectiveness of the descriptor in different tasks. We showed better performance against other state-of-the-art methods in both supervised and unsupervised settings.