Word Embedding In IR

Master of Science in Informatics at Grenoble
Master Mathématiques Informatique - spécialité Informatique
option Artificial Intelligence and Web
Word Embeddings for Information
Retrieval
Bhaskar Chatterjee
24/06/2016
Research project performed at MRIM, GETALP, LIG Lab
Under the supervision of:
Jean Pierre Chevallet and Christophe Servan, LIG
Defended before a jury composed of:
Prof. James L. Crowley
Prof. Edmond Boyer
Prof. Dominique Vaufreydaz
Prof. Jean-Sebastien Franco
Prof. Laurence Nigay
Prof. Thomas Ropars
Prof. Cyril Labbe
June 2016

Abstract
Recent research in word embeddings learned by deep neural networks has
gained a lot of attention in the natural language processing domain. These word
embeddings not only provide a good word representation but also capture rich sim-
ilarities between words based on their context. This work presents a state of the
art word embedding learning technique word2vec and how to improve textual re-
trieval effectiveness by using rich semantic similarities of words. In addition, we
discuss the usage of word embeddings in language models a state of the art ap-
proach for matching queries and documents and propose some tests to validate the
effectiveness of word embeddings in textual information retrieval.

Contents
1 Introduction 1
2 State of the art 3
2.1 Techniques that attempt to tackle term mismatch . . . . . . . . . . . . . . . . . 4
2.2 Techniques for computation of term relations from raw data . . . . . . . . . . . 5
2.3 Language Modelling approach in IR . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Proposed approach 17
3.1 Retrieval toolkit and Collection used . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Retrieval models and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Results comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Conclusion and future work 31
Bibliography 33

1
Introduction
Information Retrieval (IR) concerns about finding the relevant documents from a collection
that matches the users needs according to the users queries. In IR, the relevance is defined as
how well a set of retrieved documents matches the information needs of the user. Relevance
can be concerned with retrieval time or novelty of the result. This type of relevance is called
user relevance. System relevance can be defined as the relevant documents which the retrieval
engine retrieves to satisfy users needs usually against some query. In IR, the fundamental goal
is to maximize the system relevance to match the user relevance. Classical retrieval models
retrieves document based on exact matching between queries and documents. Exact matching
means only the query terms will be matched against the document. Exact matching of the terms
between queries and document are not able capture semantic relationship between the terms.
When users use different query terms than the terms contained in the document which convey
the same meaning, the retrieval will suffer from the term-mismatch problem. For example, a
user is interested in knowing Obama’s visit in Europe and treaties signed. So, user fires the
query ”USA president visit in Europe and political consequences”, but french newspaper con-
sists of documents like ”Obama visits Paris and new laws for import are signed”. Similarly,
English (U.K) newspaper consists of documents like ”London and Washington signed new visa
reforms after Obama’s visit ”. So in both the cases, user will not be able to retrieve these
documents, since none of the query words occur in the document even though ”US President”
signifies ”Obama”, ”Paris, London” signifies Europe and ”Import laws, visa reforms” signifies
”political consequences”.
Several techniques have been proposed to overcome the problem of term mismatch. Some of
the notable techniques are relevance feedback [35], dimension reduction techniques like LDA
[36] and integrating term similarity into retrieval models. It is easier for humans to judge simi-
lar words after seeing the context. By context we mean the sentence or set of terms surrounding
the target term. But for retrieval engines it is a difficult task to guess or understand the context.
Semantic information about relatedness of words can be obtained from an external knowledge
source1. But there are some challenges with these external knowledge sources. Not only these
external knowledge are expensive both in terms of time and money, but also need constant
updating as language evolves over time. Another problem when choosing similar words from
these external resources (which are quite general purpose) sometimes brings noise (words can
be similar but do not relate in the context2). For example, if users use a query which means
1External knowledge can be a thesaurus or external lexical databases like Wordnet
2A context in text usually means the sentence or window of words surrounding the target word.

”health benefits of milk” the relevant documents should include health benefits of milk, maybe
other dairy products like yogurt,cheese etc. but not about ”cows”.
To overcome the short comings of the these external resources we propose to use word2vec
[6], which is a neural network based approach to learn word embeddings. Word embeddings
are language modelling techniques where words are mapped to real numbers in a low dimen-
sional space relative to the vocabulary size so that words show some kind of features (semantic
relatedness, word association etc.) in this low dimensional space. These learned word embed-
dings – as mentioned by the author [6] – contain the semantic similarities for words. Earlier
techniques for computing these word embeddings through dimensionality reduction techniques
like matrix factorization suffered some major problems, first they were really computationally
expensive and secondly even with good hardware (lets say 32 core system with 128 gigabyte of
memory) could take days to complete. Third problem was for very large corpus (roughly 5-10
gigabyte) it was impossible to compute the embeddings. These word embeddings proposed by
Thomas Mikolov [5] are fast to train, for us it took 5 hours to build the knowledge resource on
16 core system with 128 gigabyte of ram. We assume these word embeddings should reduce
the noise while choosing the similar words as these embeddings are trained in the collection
and are learned on context of words.
In this thesis we propose two approaches to exploit the semantic space of the word em-
beddings, one is to use this similarity in a classical probabilistic IR model and for the second
approach we propose a vector space model. Due to time constraints we are not able to build the
system for our second approach but provide our proposition. For our first approach, to integrate
the similarities in the retrieval, we have chosen the ALMasri and chevallet’s [18] probabilistic
language model. Their model is based on extensive document matching where each query term
will be matched against all the documents unlike to classical approach of matching query terms
only with the documents which contain the query term. This thesis will discuss in detail about
the impact of integrating semantic features of neural net based word embeddings in Informa-
tion retrieval. For this we propose a series of experiments to check the usefulness of semantic
space of the word embeddings.
2

2
State of the art
An information retrieval system in most bare bone terms gives a ranked list of documents given
a query from a list in the system that matches the query in best possible way. Queries are ei-
ther short or long depending upon the domain of retrieval (e.g medical and legal domains may
contain long queries whereas web searches contain really short queries). Usually the number
of words contained in a document is much higher than in a query, so queries provide shallow
information and documents contain a lot of information. The fundamental problem in infor-
mation retrieval is how to capture maximum information from this queries and match with
documents which contain the related information. Also in English language a lot of different
words convey similar meanings and most of the time words in queries are similar to different
words in the document. For example, the query :- ”Obama visits Europe” and documents con-
tains sentences like ”The president of united states of America arrived in Germany to discuss
rising environmental concerns”. Even though the query terms are semantically related to the
document, probabilistic methods for retrieval will fail to capture the similarities. This issue,
known as term mismatch, and has an impact on the recall of most information retrieval sys-
tems. Recent research formally deﬁned the term mismatch probability, and showed that on
average a query term mismatches (fails to appear in) 40% to 50% of the documents relevant
to the query [2]. This situation gets worse when there are multiple terms in the query which
did not match with the terms in the document, in this case the number of relevant documents
degrade quite fast. Even when search engines do not require all query terms to appear in result
documents, including a query term that is likely to mismatch relevant documents can still cause
the mismatch problem. The retrieval model will penalize the relevant documents that do not
contain the term, and at the same time favor documents (false positives) that happen to contain
the term but are irrelevant. Since the number of false positives is typically much larger than the
number of relevant documents for a topic, these false positives can appear throughout the rank
list, burying the true relevant results.
This thesis will discuss this vocabulary mismatch problems and our proposed way to overcome
them. There are various techniques that attempt to solve the problem of mismatch.

2.1 Techniques that attempt to tackle term mismatch
2.1.1 Relevance feedback and pseudo relevance feedback
In relevance feedback the user is involved in the retrieval process. The underlining idea of the
relevance feedback is that it is difficult to produce a good query if the user does not have an
idea of the collection. Relevance feedback is an iterative process where a user first gives an
initial query then judges the retrieved document and then reformulates the query for a better
retrieval. In such cases, relevance feedback can be effective in understanding user’s own in-
formation needs. After seeing some documents, the user can refine his own understanding of
the information which he/she is seeking. The Rocchio algorithm [28] is the classic algorithm
for implementing explicit feedback which enables the user to select the relevant documents in
order to reformulate the original query.
Relevance feedback is not effective in real world scenarios where users do not like to iterate
over writing queries and checking documents to have a good query formulation.
Pseudo relevance feedback is also called blind relevance feedback, since it attempts the iter-
ative process of query refinement from local analysis. By local we mean without the knowledge
of external resource or user’s involvement. The method first does an normal retrieval to find an
initial set of relevant documents then top k documents are selected. Then from these documents
terms are selected and query is reformulated. Then this new query is used to retrieve the rele-
vant document. Pseudo relevance feedback had shown to give good results in the past, TREC-3
adhoc competition was dominated by such expansion techniques where it performed quite well
[20]. There also lies some problem with this technique, if the top retrieved documents are not
relevant then the original query’s topic might change to an unintended direction.
2.1.2 Dimension reduction
The main idea behind dimension reduction is minimizing the semantic gap between the query
and document. Some of the notable mentions are stemming, Latent Semantic Indexing(LSI),
Conceptual Indexing. These techniques try to increase the chance that query and document
represent the same topic or concept even when they have different terms. Ahmed [29] performs
a stemming method according to the context of the query which helped them to improve the
accuracy and the performance of their retrieval. Stemming is a process of reducing a word
to its word root or base form. For example, browse or browsing becomes after stemming to
brows, etc. Sometimes after stemming, the original word is lost and sometimes different word
forms take same roots for example ”accusation” and ”accustom” both take the root ”accus”
after stemming done with porter stemmer.
Deerwester [9] proposed to solve the query mismatch by representing query and document
in the latent semantic space. In latent semantic space each term is grouped with its similar
terms. In latent semantic space similar words tend to share the same space. LSI uses singu-
lar value decomposition, a mathematical technique to do matrix factorization on a matrix of
terms and documents. This matrix is very huge, usually the size is number of words in the col-
lection times number of documents. This method is very computationally intensive and quite
impossible to perform on large collections.
The success of dimension reduction techniques depend on the application domain and the
characteristics of the studied collection. Also, reducing the dimension can also result in a very
4

simplified term space that may harm the expressiveness of the language and could relate in
incorrect classification of unrelated term.
2.1.3 Query expansion with external source
To improve the results which specify user’s best interest from the queries, a query is expanded
with additional relevant terms and re-weighting the terms in the expanded query. This way one
can obtain additional relevant documents. One way to get this additional terms is through a the-
saurus, lexical database like wordnet, automatically generated thesaurus, word embeddings etc.
There lies some problem of using a manual thesaurus or a lexical database, since these
vocabulary resources are very expensive to construct in both time and money. Also it is very
difficult to update them since there are new words which are invented every now and then.
Another problem is that it may not exist a vocabulary resource for certain languages. One way
to avoid this problem is by using fast computational methods by which we can infer the term
associations. Automatic generation of thesaurus and word embeddings are two of the ways to
compute the relationship between terms.
2.2 Techniques for computation of term relations
from raw data
2.2.1 Automatic generation of thesaurus
As an alternative to manual thesaurus we can compute an automatic thesaurus in a cost effective
way from the large document set. There are two ways to compute the thesaurus. One by simply
exploiting the word co-occurrence matrix by counting text statistics for similar words. Another
approach is exploiting the grammatical properties of the language by doing some grammatical
analysis to find grammatical relationships. The idea behind such an approach exploits the fact
that words which occur in similar context will have semantic similarities for example grass,
cattle’s, herbivores, milk, red meat etc. will all relate to cows. Thesaurus generated using
word co-occurrences is more robust while thesaurus generated using grammatical relationship
is more accurate. Simplest example for such an thesaurus would be a thesaurus containing a
count of words that follow the target words. This way we will have a probability of word pairs.
2.2.2 Latent semantic indexing and latent dirichlet allocation
Most of the retrieval methods in IR system are based on the assumption of term independence.
For example, the vector space model (VSM) assumes that the documents are embedded in a
mutually orthogonal term space, while probabilistic models, such as the BM25 or the language
model (LM) assume that the terms are sampled independently from documents. Standard ap-
proaches in IR take into account term association in two ways, one which involves a global
analysis over the whole collection of documents (e.g. independent of the queries), while the
other takes into account local co-occurrence information of terms in the top ranked documents
retrieved in response to a query. Our approach is mostly based on the global analysis over
the collection. LDA (Latent Dirichlet Allocation) [10] and LSI (Latent Semantic Indexing)

[9] are some of the approaches that allow us to compute term association over the collection
but they do so at the document level and do not take local context (can be sentence or just n
surrounding words) of words into account rather context in LSI and LDA is the document. In
latent semantic analysis documents are represented in a term space of reduced dimensionality
so as to take into inter term dependencies. In Simple terms LSI takes a term document matrix
or bag of words model and on this matrix does singular value decomposition(SVD) so as to
extract the term dependencies. LDA represents term dependencies by assuming that each term
is generated from a set of latent variables called the topics [10]. One of the issues of using
this kind of techniques is that they take into account term dependencies at document level but
with the word embeddings techniques we can have ﬁne grained relations between words which
taken into account the local window (contexts) of the words.
Another way of computing the thesaurus is by neural networks. This brings us to the ap-
proach of using word embeddings, recent ﬁndings in word embedding learning techniques
especially word2vec [5] [6], which is a shallow two-layer neural networks, that are trained to
reconstruct linguistic contexts of words: the network is shown a word, and must guess which
words occurred in adjacent positions in an input text.
2.2.3 Brief about word embeddings
Word embedding is a technique for representing a word in vector space to extract some fea-
tures by mapping words or phrases from the vocabulary to vectors of real numbers in a low-
dimensional space relative to the vocabulary size. There are various ways to learn this word
embeddings such as by neural networks, dimensionality reduction on the word co-occurrence
matrix etc. Now we will discuss a neural network based technique called word2vec to learn the
embeddings.
Word2vec Word2vec is a set of two algorithms – CBOW and Skipgram – that produces
vector representations of words in latent space of N dimension (N is the size of vectors). The
whole intuition behind word2vec is that words occurring in similar context will have similar
meaning. For example: ”There are a lot of x in park. x are eating grass.” In this sentences from
the context we can easily infer x can be cow, sheep etc. Vector representation of words are
here from a 2003 as proposed by Yoshua Bengio [24], what word2vec provides is a very fast
way to compute vectors which instates the contextual information of the words. Also, qualita-
tively speaking these vector representations according to the author [5] produced good results
on analogy tasks [6], with some linear vector operations the author states that we can get new
relationships for e.g
vector(”Paris”)−vector(”France”)+vector(”Italy”) = vector(”Rome”)
Word2vec uses a single hidden layer, fully connected neural network as shown below. The
neurons in the hidden layer are all linear neurons. The input layer is set to have as many
neurons as there are words in the vocabulary for training. The hidden layer size is set to the
dimensionality of the resulting word vectors. The size of the output layer is same as the input
layer. Figure 1.1 is representative of the neural architecture.
6

Figure 2.1 – A simple CBOW model with only one word in the context [8]
The idea behind the word2vec is that if the network is shown a word, it will try to predict
the next word. The input to the network is encoded using ”1-out of -V” representation where
V is the size of the vocabulary meaning that only one input line is set to one and rest of the
input lines are set to zero. There are two models word2vec use. One is CBOW and other
skipgram for the prediction. CBOW is trained to predict the target word t from the contextual
words that surround it, c, i.e. the goal is to maximize P(t,c) over the training set. Skipgram on
the other hand, predicts the contextual words from target words. Skipgram turns out to learn
ﬁner-grained vectors when one trains over more data. Below are the two models presented
Figure 2.2 – CBOW and Skipgram [5]
The main focus of Mikolov’s paper was the skip gram Model. In his model from a corpus
of words w and context c, the aim is to set the parameter θ in P(c|w;θ) where P(c|w) is the

conditional probability of the word given corpus to maximize the corpus probability
argmax
θ
∏
w∈corpus
∏
c∈C(w)
P(c|w : θ) (2.1)
Here C(w) is the context of the word w . We can also write this equation as
argmax
θ
∏
(w,c)∈D)
P(c|w : θ) (2.2)
where D is the set of all word context pairs from the text.
One way to learn this probability is by modelling using the softmax function which comes from
neural networks.
P(c|w : θ) =
evc.vw
∑c∈C evc.vw
(2.3)
here vc is the vector representation of the context word c and vw is the vector representation of
the word w. C is the set of all available context. Wights θ are vci ,vwi for w ∈ V(vocabulary)
c ∈ C, i ∈ 1,...p (a total of |C|× |V|× |p|). now we need to maximize this equation so taking
log and final equation looks this
maxθ ∏
(w,c∈P)
P(c|w : θ) = ∑
(w,c∈P)
logevc.vw
−log∑
c
evc.vw
(2.4)
from the author[5] equation 2,3 is, vw and vw are input and output vector of the word w. W
is the vocabulary
P(c|w : θ) =
evwo
T .vwI
∑W
w=1 evw
T .vwI
(2.5)
Maximizing the equation 2.4 should result in good vectors or embeddings for vw (∀w ∈ V) as-
suming similar words will have similar vectors.
Equation 2.4 is very computationally expensive as due to this part ∑loge
v
c
.vw
since this part
sums over all possible contexts and there can thousands of contexts for a word.
Mikolov [5] in his paper presented two different cost functions to make it computationally ef-
ficient. One is hierarchical softmax and other is negative sampling. Both are very different
approaches from each other.
In hierarchical softmax only log2 |V| is evaluated in the output layer instead of whole vocabu-
lary as in softmax. The hierarchical softmax uses a binary tree where all the vocabulary words
are at the leaves and every node defines a probability to visit its child node. Concretely each
word can be reached from the root of the tree. Let n(w, j) be the j−th node on the path from the
root to w and let L(w) be the length of this path where n(w,1) = root and n(w,L(w)) = w. For
inner node n, let ch(n) be an any fixed child of n and [x] be 1 if x is true and and -1 otherwise.
The hierarchical softmax defines P(wO|wI) as follows :
P(w|wI) =
L(w)−1
∏
j=1
σ([n(w, j +1) = ch(n(w, j))].vw,j
T
vwI) (2.6)
8

here σ(x) = 1
1+e−x .
Idea for negative sampling is more straight forward instead of updating all the output vec-
tors we update only a sample of them. With negative sampling they replaced the objective
function(equation 2.4) to a new one.
logσ(vwo
T
vwI)+
k
∑
i=1
Ewi Pn(w) logσ(−vwi
T
vwI) .....[5] (2.7)
In his paper Mikolov [5] recommends using the skip-gram model with negative sampling
(SGNS), as it outperformed the other variants on analogy tasks. There are more heuristics taken
into account for example they sub-sampled frequent words with the corresponding formula.
P(w) = 1− t/ f(w) (2.8)
P(wi) represents each word wi in the training set is discarded with probability computed by
the formula above. f(wi) is the frequency of the word and t is the threshold generally around
10−5 . According to the paper this method was chosen heuristically. We choose their method
of getting the term relationships because according to the author vectors contain semantic sim-
ilarities and we in our approach will try to exploit this similarity.
Things to keep in mind Word2vec doesn’t distinguishes between various meanings a word
can take. Each word has only one meaning according to the context and normalised accord-
ingly. We will not explore in this direction anymore since we are not exploring in term disam-
biguation way.
2.3 Language Modelling approach in IR
Language Modelling approach to information retrieval was proposed by Ponte and Croft [1].
This approach models the idea that a document is a good match to a query if the document
model is likely to generate the query, which happens if query words are contained in the doc-
ument. The language modelling approach builds a probabilistic language model Md for a doc-
ument d, and ranks document based on the probability of the model generating the query:
P(q|Md) where q is the query for which documents are to be retrieved.
2.3.1 Unigram, bigram, N-gram models
So what is the meaning of document model generating a query? A traditional generative model
of language, of the kind familiar from formal language theory can be used to recognize or
generate strings. For this we can build probabilities over terms in the document. These prob-
abilities can be independent or conditioned depending upon the model is chosen for eg. the
simplest model unigram throws away all conditioning contexts and computes probabilities of
each term independently. Unigram model for three terms in the query for the document
Punigram(t1t2t3) = P(t1)P(t2)P(t3) (2.9)

If term dependence is taken into account chain rule can be used to decompose the probabil-
ity of a sequence of events into the probability of each successive event conditioned on earlier
events. For example if we want to know how much each term is dependent on the next term
one can use bi-gram modeling
PBi−gram(t1t2t3) = P(t1)P(t2|t1)P(t3|t2) (2.10)
This way we can continue to n terms such an approach can be called as n gram modelling
using conditional probabilities. There are more complex language models based on grammar
of a language called probabilistic context free grammars.
2.3.2 The query likelihood model
In query likelihood model for each document d a language model Md is constructed. The
approach is to rank the document by P(d|q), here the probability of the document is estimated
with the likelihood of the document being relevant to the query.
P(d|q) =
P(q|d)P(d)
P(q)
(2.11)
The P(d|q) is decomposed with bayes rule. Here important thing to note P(d) and P(q) can
be computed before and can be treated as constant. The ranking of the document being relevant
to the query can be estimated as the likelihood that the document will generate that query. So
we can say that query likelihood model attempts to model the process of query generation and
in the process documents are ranked by the probability that a query would be observed as a
random sample from the document model. For this one can use multinomial unigram language
which is same as naive bayes model. With this model each document is considered as a class
which can be thought up like a separate language.
P(q|Md) = Kq ∏
t∈q
P(t|Md) (2.12)
Here Kq is the multinomial coefﬁcient for query q and is constant for the query.
2.3.3 Estimating the query from the document
The probability of estimating the query given the language model (LM) for a document d with
the maximum likelihood model with unigram assumption is
P(q|Md) = ∏
t∈q
Pmle(t|Md) = ∏
t∈q
t ft,d|Ld (2.13)
Here Md is the language model of the document d, t ft,d is the raw count of the term t in the
document and Ld is the length of the document or total number of terms in the document d. The
equation which is presented above is a very classical model for estimating the query generation
probability from the document. The problem with such an approach is that terms are sparse in
the document. It is possible that words that occur in the query might not be in the document
in such a case this model will calculate zero probability to the query estimation even if some
words occur in the document. Clearly such a model is a big problem in information retrieval
both for ranking and matching. We will only get a non-zero value only when all the terms are
10

present in the document, which is not always possible since user doesn’t have exact idea of
the distribution of the words, he/she only has idea of the concept of the document and so all
the query words may not be contained in the document. In order to solve the problem of zero
probability and signiﬁcance of words in the distribution there is a technique called smoothing
[11] which assigns the probability weights to the terms according to their distribution in the
collection.
Smoothing and Extension of language model
The approach for smoothing is that a non occurring term could be possible in the query but its
probability should be close but not be more than its likelihood of occurrence from the whole
the whole collection. That is if t ft,d = 0
P(t|Md) ≤ cft|T (2.14)
Here cft is the total count of the term in the collection and T is the total number of all terms in
the document. In 1980 Jelinek Frederick and Robert L. Mercer [4] proposed a mixture between
a document-speciﬁc multinomial distribution and a mutinomial distribution obtained from the
entire collection. They proposed
P(t|d) = λPmle(t|Md)+(1−λ)Pmle(t|Mc) (2.15)
where λ between 0 and 1 and Mc is the language model of the entire collection. The value
of lambda has a big impact in the performance of the model.
Another smoothing method is based on dirichlet prior’s [7], where a language model is built
from the entire collection as a prior distribution in a Bayesian updating process.
P(t|d) =
t ft,d + µPmle(t|Mc)
Ld + µ
(2.16)
Like the query likelihood model there is another language modelling technique called doc-
ument likelihood model where language model from the query is constructed and probability
of query generating the document is computed. Such an approach is less appalling since there
is lot less text in queries compared to documents so the language model of the query will be
worst computed, and it will need to be smoothed from other language models.
Another way is to make a language model from both document and query and calculate how
different are these two models from each other. John lafferty and zhai [16] in their paper de-
veloped a risk minimization principal for document retrieval. They suggested that the way
to model the risk of returning a document d as relevant to query q is to use the Kullback-
Leibler(KL) divergence between both the models. KL is a divergence measure which measures
how bad the probability Mq is at modeling Md.
R(d;q) = KL(Md|Mq) =
∑t∈V P(t|Mq)logP(t|Mq)
P(t|Md)
(2.17)
Lafferty and Zhai stated in their paper that the model comparison outperformed both query
likelihood and document likelihood models.

2.4 Related work
Language modelling techniques do not take into account the problem of synonymy. Various ap-
proaches have been proposed to tackle this problem by extending the language models. Fabio
Crestani [17] proposed certain frameworks to exploit the term mismatch problem. He pro-
posed that probabilistic retrieval models can be modeled as a dot product between query and
document.
RSV(d,q) = ∑
t∈q
wd(t)∗wq(t) (2.18)
where wd(t) is the weight of the term in document d and wq(t) is the weight of the term in query
q. Crestani used a similarity function Sim. Sim(ti,tj) = 1 that is if ti = tj that means similarity
is maximum(usually for the same term) if Sim(ti,tj) = 0 then ti = tj that means terms don’t
have semantic relation, he then added the function similarity to the above equation. Crestani
does this in two ways, first in the case of mismatch that is ti ∈ q and ti /∈ d then he finds the
term tj from the document which is closest to the word ti in the query that is the maximum
similarity. So extended RSV for document to query becomes
RSVmax(d,q) = ∑
t∈q
Sim(ti,tj)wd(tj)∗wq(ti) (2.19)
Also instead of just calculating the score of maximum similar term he proposed to compute
for all related terms from the document to a non matched query term.
RSVtot(d,q) = ∑
ti∈q
[ ∑
tj∈d
Sim(ti,tj)wd(tj)]∗wq(ti) (2.20)
Taking inspiration from crestani [17] work on similarity, ALMasri and Chevallet [18] pro-
posed a new extended language model to take into account the similarity of the query word to
document word in case of a mismatch. They proposed to modify the document index according
to the query and the external knowledge about the term relations. They expanded the document
d by the query terms which are semantically related to at-least one document term. The idea
here is to maximize the coordination of document and query and in a process maximize the
probability of retrieving the relevant documents for a given query. Below in the figure 2.3 is a
pictorial description of the process.
Formally they mentioned the modified document dq by
dq = d ∩F(q/d,K,d) (2.21)
where d is the original document, K is the knowledge source, F(q/d,K,d) is the transformation
of q/d. K provides the semantic similarity between the terms t and t that is Sim(t,t ). For
unmatched term t in the query they looked fro term t∗ in the document which has the maximum
similarity from the term t.
t∗
= argmaxt ∈dSim(t,t ) (2.22)
Then the occurrences of the query terms t in the modified document dq rely on the occur-
rences of the most similar term freq(t∗,d) then the pseudo occurrences of t is as follows
freq(t,dq) = freq(t∗
;d).Sim(t,t∗
) (2.23)
12

Figure 2.3 – Expand the document d using the knowledge K. [18]
this pseudo occurrences of the term t∗ are then included in the modified document dq. The
translation function F now becomes
F(q/d,K,d) = [t|t ∈ q/d,∃t∗
∈ d,t∗
= argmaxt ∈dSim(t,t ] (2.24)
So now dq becomes
dq = d ∩[t|t ∈ q/d,∃t∗
∈ d,t∗
= argmaxt ∈dSim(t,t ] (2.25)
The length of dq is calculated in following way
|dq| = |d|+ ∑
t∈q/d
freq(t∗
;d)Sim(t,t ) (2.26)
They then used this modified document dq instead of d. They assumed that this new proba-
bility estimation would be more accurate than ordinary language model.
Then they took two language models Dirichlet smoothing and Jelinek -Mercer Smoothing;
extended it with the similarity function and modified document.
Dirichlet Smoothing
Pµ(t|d) =
t ft,d +(µ)Pmle(t|Mc)
|d|+ µ
(2.27)
They extended the above equation to include similarity
Pµ(t|dq) =



t ft,d+(µ)Pmle(t|Mc)
|dq|+µ , if t ∈ d
t ft∗,d.Sim(t,t∗)+(µ)Pmle(t|Mc)
|dq|+µ , if t /∈ d
(2.28)

If all the query terms occur in the document then |dq| = |d|; so the probabilities Pµ(t|dq) =
Pµ(t|d) .
Similar to extending the dirichlet smoothing they extended the Jelinek-Mercer smoothing
model.
Jelinek-Mercer
Pλ (t|d) = (1−λ)P(t|d)+λP(t|C) (2.29)
Extended Jelinek-Mercer Smoothing
Pλ (t|dq) =



(1−λ)freq(t;d)
|dq| +λP(t|C), if t ∈ d
(1−λ)freq(t∗;d).Sim(t,t∗)
|dq| +λP(t|C), if t /∈ d
(2.30)
For the similarity function Sim(t,t ) ALMasri and Chevallet made an assumption that the
term t is semantically related to a term t , if t is a descendant of the term t in the term hier-
archy within an external knowledge K. For the term t from query and t from document from
vocabulary V, they deﬁned Sim(t,t ) as follows, Sim : V ∗V ⇒ [0,1] :
∀t,t ∈ V,0 ≤ Sim(t,t ) ≤ 1 (2.31)
1. Sim(t,t ) = 0, if t and t are not semantically related and t = t .
2. Sim(t,t ) ≤ 1, the t is a descendant of term t in term hierarchy of K and t = t .
3. Sim(t,t ) = 1 if t and t are same that is t = t
The similarity of terms is computed as the inverse of the distance between them.
Sim(t,t ) =
1
distance(t,t )
,distance(t,t ) > 0 (2.32)
For their experiments they used CLEF 1 corpora(medical domain corpora) with UMLS 2
as an external knowledge base. UMLS is a multi-source knowledge base for medical domain.
Instead of using words to index documents they used concepts. UMLS provided them with
concepts and document, query are mapped using MetaMap [14]. They experimented using
X-IOTA [19]. They compared their results with smoothed language models and statistical
translation model based on Jelinek and dirichlet smoothing. ALMasri and Chevallet stated that
their model performed better than the smoothened language models and translation models by
having considerable gain over other language models. Below we present their results
1http://www.clef-initiative.eu/
2 Uniﬁed Medical Language System (http://www.nlm.nih.gov/research/umls/).
14

Figure 2.4 – MAP of Extended Dirichlet smoothing and Extended Jelinek-Mercer smoothing
after integrating concept similarity. The gain is the improvement obtained by their approach
over ordinary language models. † indicates a statistically signiﬁcant improvement in over ordi-
nary language models using Fishers Randomization test with p < 0.05. [18]

3
Proposed approach
So taking motivation from the work of ALMasri and Chevallet [18], we decided to check the
validity of the semantic features of the embeddings in the information retrieval against their
model. We choose their language model as we assume that similarity can be best exploited at
document level instead of adding similar words to query i.e in case of non occurrence of query
words in the document, new word should be chosen from the document which best matches the
query term. This way every document can be evaluated even in case of complete mismatch of
document and query terms. We assume given the good word embeddings even in the case of
total mismatch i.e when none of the query terms occur in the document we can retrieve rele-
vant documents if the documents and query share semantic relatedness in the semantic space.
To put this hypothesis to test we proposed some experiments so to have an understanding of
usefulness of semantic features of the word embeddings in Information Retrieval.
Our second approach is based on the vector space model. In the classic vector space model
proposed by Salton [28], the weights of word in documents are a combination of document
frequency of the term and global(collection) frequency of the term. In the original paper the
documents are represented by vd = (w1,d,w2,d,..wN,d). The importance of each word is repre-
sented by its weight which is computed as
wt,d = t ft,d log
|D|
|d ∈ D|t ∈ d |
(3.1)
Here t ft,d is the term frequency in the document d, log |D|
|d ∈D|t∈d |
is the inverse document
frequency where D represents the collection and d is the document with t as the term. Similarity
of the query q to document dm is calculated by cosine similarity.
Sim(dm,q) =
dm.q
|dm||q|
=
∑N
i=1 wi,mwi,q
∑N
i=1 w2
i,m ∑N
i=1 w2
i,q
(3.2)
Here wi,q and wi,m are terms in query and document respectively.
In our approach we propose a simple weighting scheme where only document frequency are
taken, we choose to omit the id f value of the term as we assume since word vectors are trained
on the whole collection every term has information regarding its occurrence with other term
vectors. Also we propose a different document vector than salton in his paper. We represent
document vector as

dm =
m
∑
i=1
t fi.wi,m (3.3)
The assumption behind such an approach is in semantic space combination of words might
represent an concept which might be closer to the query vector if they represent similar con-
cepts. We can obtain the query vector in a similar way to document vector
qj =
j
∑
i=1
t fi.wi,j (3.4)
Similarity between query and document vector can be exploited in the same way as pre-
sented in Salton’s paper using cosine similarity.
Sim(dm,q) =
dm.qj
|dm||qj|
(3.5)
One of the key differences between the vector space model proposed by salton and ours is
that, in salton’s document vector space each term in the document was the dimension of the
document vector and in our case each dimension of the document vector is the weighted sum
of corresponding dimension of the word vectors. Our document vector length is not dependent
on the number of words contained in the document but rather the dimensions of the word
vector. We get the word embeddings from word2vec. With our model all the vectors whether
it is document or query will all be of same length.
Since we don’t have time to build the system we keep this approach as future proposition.
For our first approach we propose some experiments and the next sections are related to our
first approach only. We start explaining now the corpus we used, the task, the retrieval toolkit
we have used, existing baselines on the task. Then we discuss about our experimental setup.
Lastly the results obtained and conclusion.
3.1 Retrieval toolkit and Collection used
3.1.1 TREC collection
TREC(Text Retrieval Conference) collection consist of three parts: the documents, the ques-
tions or topics and the relevance judgments or ‘right answers’.
TREC documents are distributed on CD-ROMs with approximately 1 Gbyte of text on each ,
compressed to fit into disks. The documents are contained in 5 disks. Disk 1-3 is called the
tipster collection and disk 4-5 is the TREC collection. Below are the description of the docu-
ments in each disks1 .
1. Disk 1: Includes material from the Wall Street Journal (1987, 1988, 1989), the Federal Reg-
ister (1989), Associated Press (1989), Department of Energy abstracts, and Information from
the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.
2. Disk 2: Includes material from the Wall Street Journal (1990, 1991, 1992), the Federal Reg-
ister (1988), Associated Press (1988) and Information from the Computer Select disks (1989,
1990) copyrighted by Ziff-Davis.
1http://www.nist.gov/tac/data/data desc.html
18

3. Disk 3: Includes material from the San Jose Mercury News (1991), the Associated Press
(1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991,
1992) copyrighted by Ziff-Davis.
4. Disk 4: Includes material from the Financial Times Limited (1991, 1992, 1993, 1994), the
Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
5. Disk 5: Includes material from the Foreign Broadcast Information Service (1996) and the
Los-Angeles Times (1989, 1990).
Below are some document statistics
Figure 3.1 – Some data statistics from Disk 1-3 [20]
There is a range of document lengths in the collection, there are short documents like DOE
and very long documents like FR. The range of the document length also varies for example
AP is quite uniform(median terms and number of terms per record) whereas ZIFF,WSJ and FR
have signiﬁcantly wider variance in lengths. The documents are formatted in SGML format
with a DTD. Fig3.2 shows the document structure

Figure 3.2 – Document structure in the collection
The topics in TREC 1 and 2(topics 51-150) have a long and complex structure. These
topics were designed to mimic real users need and made by system who are real users of the
retrieval system. TREC 1 and 2 topics consist of concepts which adds to the information needs
of the users. Topics of TREC 3(topics 151-200) are much shorter than TREC 1 and 2, and
also concept ﬁeld is removed to mimic more a general user which gives system no information
besides the query itself. Below are the structure of topics from TREC 1,2 and TREC 3
20

(a) TREC 1 Topic example
(b) TREC 3 Topic example
Figure 3.3 – TREC Topic example
Relevance Judgement File : It is quite necessary to find a list of relevant and non rele-
vant documents to evaluate the performance of the retrieval system. Relevant judgement files
contains this information. So this list should be as comprehensive as possible. All the three
TREC’s have used a pooling method based on Jones and Rijsbergen work[21] to create the
relevance assessments. In this method a pool of relevant file is generated by taking a sample
of relevant top X documents from various participating systems. This sample is then shown to
human judges.
3.1.2 The Task
The adhoc task [12] investigates the performance of systems that search a static set of doc-
uments using new topics. This task is similar to how a researcher might use a library – the
collection is known but the questions likely to be asked are not known. Fig. 4.1 depicts how
the adhoc task is accomplished in TREC. Participants are given a document collection consist-
ing of approximately 2 Gbytes of text and 50 new topics. The set of relevant documents for
these topics in the document set is not known at the time the participants receive the topics.
Participants produce a new query set, Q3, from the adhoc topics and run those queries against
the adhoc documents. The output from this run is the test result for the adhoc task.

Figure 3.4 – The Ad-hoc task for retrieval in TREC [12]
3.1.3 Retrieval Toolkit
For the retrieval we are using an open source tool called Terrier(, Terabyte Retriever)2 [22][23]
[25] [26]. Terrier is written in Java, and is developed at the School of Computing Science,
University of Glasgow.
Terrier is designed as a tool to evaluate, test and compare models and ideas, and to build sys-
tems for large-scale IR. Since it is an open source platform we decided to test our methods
and experiments using terrier. Information retrieval in terrier is done in three stages. First the
collection for the experiment is indexed , then choosing a matching model which terrier uses
for retrieval. Third step is the evaluation.
For the indexing the corpus of documents is handled by a Collection plugin, which generates
a stream of Document objects. Each Document generates a stream of terms, which are trans-
formed by a series of Term Pipeline components, after which the Indexer writes to disk.
2http://terrier.org/
22

Figure 3.5 – Indexing Structure of Terrier [25]
For the retrieval The application communicates with the Manager, which in turn runs the
desired Matching module. Matching assigns scores to the documents using the combination
of weighting model and score modifiers. Terrier provides a lot of weighting models 3. We
extended the terrier platform to include the Jelinek-Mercer smoothing model, extended Jelinek-
Mercer smoothing model model and extended dirichlet smoothing model for our experiments.
Figure 3.6 – The retrieval architecture of Terrier. [25]
For the evaluation Terrier for TREC data takes the evaluation file provided by us and com-
putes the scores. More details in next sections.
3http://terrier.org/docs/v4.1/configure retrieval.html

3.2 Experimental setup
3.2.1 Word Embedding Parameters and Data Pre-processing
Our dataset contains of 741K documents which contains roughly 171M tokens. Since this data
is in raw format it needs to be tokenized first. For this we used moses [37] tokenizer as a
standard way to tokenize the raw dataset. Since the chunk of the useful data is only contained
in ”TEXT” tag we decided to process only that. Other tags in the document contains data set
does not contain semantically useful information. In the next step we stemmed the data so as
to have simpler word form and increase the number of context for each word. We learn the
word vectors with these processed dataset. This data is then fed individually through a neural
network with one hidden layer of size 200 which corresponds to the dimension of the output
vectors. The learning rate α is choosen as 0.05. We learned the word embeddings with skip
gram model with window size of 10 with negative sampling with negative sample for each data
as 5. Mikolov [5] described the optimal value for small training data set to be between 5-20
and for large training between 2-5. Also stopwords like ”the” , ”in” etc. doesn’t provide any
useful semantic meanings which occur quite frequently in the corpus, to limit the usage of this
stopwords we subsampled the frequent words where the subsampling threshold is choosen to
be 1e−4 which is 0.0001. Also misspelled or very rare words doesn’t provide much information
of their context so we did not train tokens whose frequency is less than 20. The window size is
chosen as 10 as we assume it to be of idle size to capture the semantic regularities.
3.2.2 Terrier
We use only ”TEXT” tags of the documents to build the indices for Terrier. From query only
”topic” tag is used to fetch the query terms. While building the indices for the documents stop-
words are removed and stemming is done. Terrier has some matching models like BM25,
DrichletLM,PL2 etc. but for our experiments we implemented some new models namely
Jelinek-Mercer, Extended Jelinek-Mercer and Extended DrichletLM. For our experiments we
have chosen the classical retrieval method including BM25,Jelinek Mercer Smoothened lan-
guage model and Drichlet language model. We choose these models to test whether including
similarity in the language models adds to the increase in retrieval performance of the system.
We tested with different varying parameters of these probabilistic models. We have put the
results and various parameter values in section 3.6.
3.3 Evaluation metrics
The classical retrieval evaluation metrics are Precision, Recall, Mean Average Precision and
Precision at index. Lets assume ReldocRetrieved signifies the total number of relevant doc-
uments retrieved by the system, ReldocNotRetrieved signifies the total number of relevant
documents not retrieved by the system, NonReldocRetrieved signifies non relevant document
retrieved by the system then Precision is defined by
ReldocRetrieved
ReldocRetrieved +NonReldocRetrieved
(3.6)
Similarly recall is defined by
24

ReldocRetrieved
ReldocRetrieved +ReldocNotRetrieved
(3.7)
Precision takes all relevant document into account but if we want to evaluate only the top
most relevant documents we generally use precision at index. These measure is called precision
at n where n is any real number. Generally most common is Precision at 10 or P@10 .
Mean Average Precision or MAP is the mean of the average precisions for the query. Formally
MAP is deﬁned as
∑
Q
q=1 AvgP(q)
Q
(3.8)
Where Q is the number of queries.
Recall oriented system are good but if their precision at higher rank is low then probably the
system might not perform well in real life scenarios. As user don’t have time to read hundred’s
of document rather user relevant document at higher ranks.
3.4 Retrieval models and results
For the baseline we have taken the baseline from the ofﬁcial TREC-3 adhoc competition [20].
Below we will provide a short description about the best performers in the competition, their
results and their methodology in brief.
Citya1 [27]: They used probabilistic term weighting scheme with topic expansion of upto 40
terms, with dynamic passage retrieval in additional to the whole document retrieval.
INQ101 [30] : They used probabilistic weighting with an inference net. They also used topic
expansion and passage retrieval in addition to whole document matching. They used an external
thesaurus built by them.
CrnlEA: Their method was based on vector space model with term weighting called smart.They
used Rocchio relevance feedback to expand terms. No topic expansion or phrase retrieval was
done.
westp1 [32]: Their method is based on lines with INQ101 with document and phrases being
used for ranking. But topic expansion was done on minimal lines. // pircs1 [33] : Thier methods
were based on spreading activation model on parts of document(550-words). Topic expansion
was done using top 6 documents with the terms in original topic. Then the top 30 topics were
choosen.
ETH002 [34] : They used a combination of vector space model, passage retrieval model using
Hidden markov’s chains and topic expansion using document links.

Figure 3.7 – Trec-3 Adhoc results [20]
For the Jelinek-Mercer Language Model and Extended Jelinek Mercer model we used the
smoothing parameter(λ) value as 0.15 as we tested with other values and found this value to
be optimum in terms of recall and average precision. For the Dirichlet and Extended Dirichlet
language models we found the optimal value smoothing parameter to be 2500. Since Extended
Jelinek Mercer and Extended Dirichlet language models rely on Extensive document matching
that is every document should be matched against the query and in case of mismatch would
match against the similar words, this process become expensive both in terms of computation
and time taken which is around 2 days. This is not feasible for both experiments and real life
scenarios, so we decide to follow some heuristic. To reduce the matching time we decided to
take only those documents which contains atleast one query word. Formally
qt ∪d,∀t ∈ query(q),∀document(d) (3.9)
This process cut our matching time from 48 hours to 12 hours. Also we had set off a sim-
ilarity limit of 0.7 for words to limit the noise occurring from too many similar words. Lastly
to gain more on the matching time we precomputed the similarity of words. Since keeping
similarity of all words in the memory would require roughly 1 terabyte of memory, we only
computed the similarity of the query terms to all other words in the collection this reduced the
memory space to 1 gigabyte.
Below are the retrieval results from our experiments. There were total of 9805 relevant doc-
uments for the 50 set of queries. For each query 1000 document were retrieved and map and
precision at 10(p@10) is computed accordingly.
26

Trec-3 adhoc run
Model MAP Total Relevant
Retrieved
Precision@10
JelinekMercer 0.2129 4823 0.4340
Extended Jelinek
Mercer
0.1623 3928 0.3720
Drichlet LM 0.2282 5091 0.4940
Extended Drich-
let LM
0.1898 4557 0.4520
Figure 3.8 – Trec-3 Adhoc results
3.5 Results comparison
After the results, none of our models beat the baseline. So, we inspected our method and found
a major problem with our Toolkit. Terrier internally tokenizes the words in a very peculiar
way. Terrier removes all the punctuation even from internal structure of the words and then
removes the terms if they are less than 3 characters in them. So it removes all abbreviations
which contain 2 characters. For example ”U.S.” becomes ”us” and then it is removed from
the indexing, as well as the queries. Most of our queries contain these abbreviation specially
the ”U.S.”, so we assume we do not gain anything from that term. Also we miss the context
of the country. So our retrieval can suffer from that. This can be a factor but just to be sure
we decided to dig deeper for queries which retrieved least relevant documents in comparison

to relevant document present in the collection. The query with query-id 152 contains ”Accu-
sations of Cheating by Contractors on U.S Defense Projects” did not produce much relevant
documents as it only retrieved 95 documents out of 538 with the Extended Dirichlet language
model. For the query the document number ”FR88928-0019” seemed quite relevant to us as it
contained information about employees of contractors who use illegal drugs in defense projects
and department of defense passing different policies to curb that. This document was deemed
irrelevant in the relevance ﬁle. Relevance seemed to be another factor as not all documents had
been judged by the human judges.
To check our hypothesis that removal of abbreviation hurts the performance, we removed
all the instances of the term ”U.S.” to ”America” and re-indexed the whole collection again.
Then we did the same for the query as mentioned above we found a totally different set of
results with the same Extended Dirichlet language model. The relevant document retrieved
increased to 178 instead of 95. It even beat the classical Dirichlet model both in terms of mean
average precision and total relevant retrieved.
Figure 3.9 – Trec-3 Adhoc results for query 152
Another important document (WSJ870715-0135) which was highly ranked in the retrieval
speaks about Japanese companies joining defense contractors, where one of the contractors
”Toshiba” building for Russia too and banned after that. These statement from the document
”These statements that accuse Japan of being a leaky sieve of high technology – that doesn’t
help the situation at all.” . It also seemed quite relevant to us what was mentioned irrelevant.
What was surprising is that all the top documents concerned about contractors role in defense
contracts and their inability to cope with the situation. All documents spoke negatively of the
28

contractors. All the irrelevant documents even were semantically close to the query which was
the goal. Relevancy seemed quite subjective to us.
We ran our test with the Extended Dirichlet Language model on another query on which
we could not gain anything. On closer inspection we found the problem. The query id 179
contained query ”U.S. Restaurants in Foreign Lands”. Top results retrieved included informa-
tion like Japanese companies buying stake in American companies and spoke negatively of
Japanese corporate invasion. The second ranked relevant document did not yiel any semantic
similarity either it contained information about foreign shipping practices of U.S federal mar-
itime commission. One intuitive feeling of such a query to work is to have term dependence
among query terms. For example, instead of finding similar words for foreign which can be
alien, different etc., foreign should be dependent on the term America and similar word for
America should be searched without changing the term America. We also ran our experiments
after necessary changes in the toolkit and datasets, but we could not finish the experiments at
the time of writing this thesis. We could only get retrieval results of the first 21 queries for
which we show the results for both ExtendedDirichlet language model and Dirichlet language
model.
Figure 3.10 – Retrieval results
We can see both retrieval models perform on similar lines, we do not gain much better
results.

4
Conclusion and future work
The results we presented did not yield the intended results. We might not have beat the base-
lines but our experiments show us some interesting facts about including the similarities of
embeddings in retrieval. We found that for the query 152, all the top documents were in the
same semantic space, they all got the same essence of the query i.e. all of the documents spoke
negatively about the contractors in U.S. defense projects. This solidifies our assumption that
by using embeddings we can capture some form of semantic similarities of documents with
respect to the queries. It is hard to find out one single reason why our retrieval did not produced
good results. One of the possible reasons can be that since all the documents are not judged by
human annotators, it is quite possible that some of the documents that are retrieved by our sys-
tem might be relevant but were marked irrelevant in the absence of reviews. Also, relevance is a
quite subjective parameter for us – some of the results retrieved were relevant but were marked
irrelevant in the relevance file. The removal of abbreviations on occasions also affected our
retrieval performance. On a careful analysis of query 179, we found a glaring loophole in our
assumption. We found that simply altering the individual query terms with similar terms would
not retrieve relevant documents. Through the analysis we found out that there is an inherent re-
lationship among terms in the query and exploiting this relationship with similarity might yield
better results. Another possible reason might be that similar words contained a lot of noise. For
instance, the terms ti t1, t2 t3 are the most similar in that order but in this context only t3. One
possible approach is to take n number of most similar terms for all query terms and do retrieval
for this expanded query, without calculating similarity of all words in the document. This way
we might be able to remove the noise.
The language models are based on the assumption that each term is independent from other
terms in the term space. We would like to explore more on lines of other works where terms
dependence is taken into account. One possible way to exploit it is by taking into account the
grammatical properties of the language. Secondly, to exploit the true nature of the semantic
space of the embeddings, we would like to test our vector space model as proposed in chapter
3. The language model we used only exploited the cosine similarity of the embeddings whereas
we assume the vector space model might truly exploit the embedding space.

Bibliography
[1] Jay M.Ponte and W.Bruce Croft A language modelling approach to information, SI-
GIR’98
[2] L. Zhao and J. Callan. Term necessity prediction. In Proceedings of the 19th ACM Con-
ference on Information and Knowledge Management (CIKM ’10). 2010.
[3] Zhao, L. and Callan, J., Automatic term mismatch diagnosis for selective query expansion,
SIGIR 2012
[4] Jelinek, Frederick and Robert L. Mercer. 1980. Interpolated estimation of Markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in
Practice, Amsterdam, The Netherlands: North-Holland, May
[5] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Dis-
tributed representations of words and phrases and their compositionality. Advances in
Neural Information Processing Systems.
[6] Mikolov, Tomas; et al. ”Efﬁcient Estimation of Word Representations in Vector Space”
[7] MACKAY, D. AND PETO, L. 1995. A hierarchical Dirichlet language model. Natural
Language Engineering 1, 3, 289-307.
[8] word2vec Parameter Learning Explained
[9] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman.
Indexing by latent semantic analysis. JASIS, 41(6):391-407, 1990
[10] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3:993-1022, March 2003.
[11] CHENGXIANG ZHAI and JOHN LAFFERTY, A Study of Smoothing Methods for Lan-
guage Models Applied to Information Retrieval April 2004
[12] Ellen M. Voorhees, Donna Harman Overview of the Sixth Text REtrieval Conference
(TREC-6)
[13] Rocchio, J. J. 1971. Relevance feedback in information retrieval.
[14] Alan R Aronson. Metamap. Mapping text to the umls meta-thesaurus. 2006

[15] Buckley, Chris, Gerard Salton, and James Allan. 1994b. The effect of adding relevance
information in a relevance feedback environment. Proc. SIGIR, pp. 292-300. ACM Press.
[16] John Lafferty, Chengxiang Zhai Document language models, query models, and risk min-
imization for information retrieval,SIGIR 2001
[17] Fabio Crestani, Exploiting the Similarity of Non-Matching Terms at Retrieval Time, Sept
1999
[18] Mohannad ALMasri, KianLam Tan, , Catherine Berrut, , Jean-Pierre Chevallet, and
Philippe Mulhem Integrating Semantic Term Relations into Information Retrieval Sys-
tems Based on Language Models
[19] Jean-Pierre Chevallet X-IOTA: an open XML framework for IR experimentation, Pro-
ceeding AIRS’04
[20] DK Harmon Overview of the Third Text REtrieval Conference (TREC-3), 1996
[21] Sparck Jones and Van Rijsbergen 1975 K. Sparck Jones and C. J. Van Rijsbergen. 1975.
Report on the need for and provision of an ideal information retrieval test collection.
Technical Report 5266, Computer Lab., Univ. Cambridge.
[22] Craig Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. From Puppy to
Maturity: Experiences in Developing Terrier. In Proceedings of the SIGIR 2012 Work-
shop in Open Source Information Retrieval.
[23] Iadh Ounis, Christina Lioma, Craig Macdonald, and Vassilis Plachouras. Research Di-
rections in Terrier: a Search Engine for Advanced Retrieval on the Web. In Novat-
ica/UPGRADE Special Issue on Next Generation Web Search, 8(1):49–56, 2007.
[24] Yoshua Bengio,R ˜A c jean Ducharme,Pascal Vincent,Christian Jauvin A Neural Proba-
bilistic Language Model,(2003)
[25] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Christina
Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In
Proceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval (OSIR
2006). 10th August, 2006. Seattle, Washington, USA.
[26] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald and Douglas
Johnson. Terrier Information Retrieval Platform. In Proceedings of the 27th European
Conference on Information Retrieval (ECIR 05).
[27] S.E. Robertson , S. Walker , S. Jones , M.M. Hancock-Beaulieu , M. Gatford, Okapi at
TREC-3 (1996)
[28] G. Salton, A. Wong, and C. S. Yang A Vector Space Model for Automatic Index-
ing,Communications of the ACM,(1975) vol. 18, nr. 11, pages 613-620.
[29] Fuchun Peng,Nawaaz Ahmed, Xin Li,Yumao Lu Context sensitive stemming for web
search (2007)
34

[30] John Broglio , James P. Callan , W. Bruce Croft , Daniel W. Nachbar Document Retrieval
and Routing Using the INQUERY System (1994)
[31] Automatic Query Expansion Using SMART : TREC 3
[32] Paul Thompson , Bokyung Yang , James Flood TREC-3 Ad Hoc Retrieval and Routing
Experiments using the WIN System (1995)
[33] K. L. Kwok , L. Grunfeld , D. D. Lewis TREC-3 Ad-Hoc, Routing Retrieval and Thresh-
olding Experiments using PIRCS (1995)
[34] Daniel Knaus , Elke Mittendorf , Peter Schauble Improving a Basic Retrieval Method by
Links and Passage Level Evidence (1995)
[35] Victor Lavrenko. W. Bruce Croft Relevance based language models
[36] David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent Dirichlet Allocation,Journal of
Machine Learning Research 3 (2003) 993-1022
[37] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,
Ondrej Bojar, Alexandra Constantin, Evan Herbst. (2007) ”Moses: Open Source Toolkit
for Statistical Machine Translation”. Annual Meeting of the Association for Computa-
tional Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.

Word Embedding In IR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word Embedding In IR

Similar to Word Embedding In IR (20)

Word Embedding In IR