Using Word Embedding for Automatic Query Expansion
1. Using Word Embedding for
Automatic Query Expansion
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra and Utpal Garain
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
Abstract
• Automatic Query Expansion intuitively
requires terms which are semantically similar
to the query terms.
• Neural Word Embedding Word2Vec captures
both semantic and syntactic regularities in the
language
• We explore Word2Vec framework to find
semantically similar terms for Query
expansion
• Experimentation results though significant
improvement over baseline, lags behind RM3
Composition of Query
Future work
C
V
P
R
Unit
Query Expansion Method Result
Discussion
Motivation
Word2Vec framework generates word
embedding which capture semantic and
syntactic regularities
It has shown improved performance in clinical
decision support and cross-lingual retrieval
We try to answer the following points:
1. Does QE, using the nearest neighbors of
query terms, improve retrieval effectiveness?
2. If yes, is it possible to characterize the queries
for which this QE method does / does not
work?
3. How does embedding based QE perform
compared to an established QE technique like
RM3 [1]?
Our Contribution
Improving retrieval performance by finding
semantically similar terms for query expansion in
ad-hoc retrieval.
Similar terms are found by computing K nearest
neighbour (K-NN) of query term.
Our contribution is two-fold:
(1) Proposing a composition function for multi-term
query for finding K-NN terms
(2) Proposing an incremental K-NN algorithm to
reduce query drift during expansion.
Query Expansion Methods
Pre-retrieval KNN: The Expansion terms
for a query 𝑸 such that 𝒒 𝟏, 𝒒 𝟐, … , 𝒒 𝒏 ∈ 𝑸 be
the n terms in it:
𝑪 =
𝒒∈𝑸
𝑵𝑵(𝒒)
𝑁𝑁 𝑞 is the 𝑘 terms closest to 𝒒 in the
embedding space
The mean similarity between a candidate
expansion term 𝒕 and all terms in 𝒒 is
computed by:
1
𝑄
𝑞 𝑖∈𝑄
𝑡. 𝑞𝑖
The Post-retrieval KNN follows the similar
approach except the expansion terms search
space is reduced to the pseudo relevant
documents.
Incremental KNN: Let the nearest
neighbors of 𝑞 in order of decreasing
similarity be 𝑡1, 𝑡2, … , 𝑡 𝑁.
Incremental KNN
We prune the 𝐾 least similar neighbors to
obtain 𝑡1, 𝑡2, … , 𝑡 𝑁−𝐾 .
Next, we consider 𝑡1, and reorder the
terms 𝑡2, … , 𝑡 𝑁−𝐾 in decreasing order of
similarity with $t_1$.
Again, the 𝐾 least similar neighbors in the
reordered list are pruned to obtain
𝑡2
′
, 𝑡3
′
, … 𝑡 𝑁−2𝑘
′
.Next, we pick 𝑡2
′
and
repeat the same process.
This continues for 𝑙 iterations.
Given a query 𝑄 consisting of 𝑚 terms
{𝑞1, … , 𝑞 𝑚}. we first construct 𝑄 𝑐, the set of
query word bigrams.
𝑄 𝑐 = { 𝑞1, 𝑞2 , 𝑞2, 𝑞3 , … , 〈 𝑞 𝑚−1 , 𝑞 𝑚 〉
We define the embedding for a bigram
〈 𝑞𝑖, 𝑞 𝑖+1 〉 as simply q 𝑖 + 𝑞𝑖+1
Next, we define the extended query term set
(EQTS) 𝑄′ as 𝑄′
= 𝑄 ∪ 𝑄 𝑐
𝑃 𝑤 𝑄 𝑒𝑥𝑝
= 𝛼 𝑃 𝑤 𝑄 + 1 − 𝛼
𝑆𝑖𝑚(𝑤, 𝑄)
𝑤∈𝑄 𝑒𝑥𝑝
𝑆𝑖𝑚(𝑤, 𝑄)
𝛼 is the interpolation parameter used to combine
the original unexpanded query with the expansion
terms.
where 𝑄 𝐾 represents the set of top 𝐾 terms from
𝐶, the set of candidate expansion terms
Retrieval
Experimental Setup
Use Composition or Not ?
Dataset Oveview
Benefit of pairwise composition
The semantic and contextual information in
Word2Vec embedding is leveraged here.
Query expansion intuitively calls for finding
terms which are similar to the query, and terms
which occurs frequently in the relevant
documents
In the proposed expansion method terms
similar to query terms at collection level
abstract space are considered for expansion
For Post retrieval KNN the search space for the
expansion terms is reduced to relevant
feedback documents.
Incremental KNN reduces the query drift
beyond semantic similarity, which is not the
case in Pre-retrieval or Post retrieval method.
This justifies consistent better performance of
Incremental method.
Experiments showing RM3 performing better
on the TREC ad-hoc and web that the co-
occurrence statistics is more powerful than the
similarity in the abstract space.
The obvious future work is to include the co-
occurrence statistics at collection level along
with Word2Vec
To address the generalization effect introduced
will further improve the performance of our
proposed method.
Local retraining of word embedding might be
one possibilityAcknowledgement