Query-drift prevention for robust query expansion

Robust Query Expansion Based on Query-Drift
Prevention
Research Thesis
In Partial Fulﬁllment of The
Requirements for the Degree of
Master of Science in Information Management Engineering
Liron Zighelnic
Submitted to the Senate of the Technion - Israel Institute of
Technology
Sivan, 5770 Haifa May 2010

The Research Thesis Was Done Under The Supervision of Dr. Oren
Kurland in the Faculty of Industrial Engineering and Management.
The Helpful Comments Of The Reviewers Of SIGIR 2008 Are
Gratefully Acknowledged
The Generous Financial Help Of The Technion - Israel Institute of
Technology Is Gratefully Acknowledged
This paper is based upon work supported in part by Google’s and
IBM’s faculty research awards. Any opinions, ﬁndings and
conclusions or recommendations expressed are those of the authors
and do not necessarily reﬂect those of the sponsors.

ACKNOWLEDGMENTS
I would like to thank my advisor DR. Oren Kurland for his guidance
I would like to thank my colleges and specially to: Anna Shtok , Erez
Karpas, Nataly Soskin, Inna Gelfer Kalmanovich and Lior Meister
I would like to express a deep gratitude to my parents, Zahava and
Michael Zighelnic, for their love and support
I would like to thank my dear husband Ilan for his help, love and
support and for always standing by me

Contents
1 Introduction 2
2 Background 6
2.1 LM - Introduction and Models . . . . . . . . . . . . . . . . . . 6
2.2 Language Model Estimation . . . . . . . . . . . . . . . . . . . 7
2.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Query Expansion 10
3.1 Pseudo-Feedback . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Query Expansion Models . . . . . . . . . . . . . . . . . . . . . 11
3.3 The Performance Robustness Problem . . . . . . . . . . . . . 14
4 Query Drift Prevention 16
4.1 Improving Robustness Using Fusion . . . . . . . . . . . . . . . 16
4.2 Score-Based Fusion Methods . . . . . . . . . . . . . . . . . . . 17
4.2.1 Symmetric Fusion Methods . . . . . . . . . . . . . . . 17
4.2.2 Re-ordering Methods . . . . . . . . . . . . . . . . . . . 18
4.3 Rank-Based Fusion Methods . . . . . . . . . . . . . . . . . . . 19
5 Related Work 21
5.1 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Improving the Performance Robustness of Pseudo-Feedback-
Based Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6 Experiments 27
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 27
6.1.3 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . 28

6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 29
6.2.1 Score-Based Fusion Methods . . . . . . . . . . . . . . . 29
6.2.1.1 Best_MAP and Best_P@5 Settings . . . . . 29
6.2.1.2 Analyzing Robustness Improvement due to
Score-based Fusion Methods . . . . . . . . . . 34
6.2.1.3 The Eﬀect of the Query-Anchoring Parame-
ter Used in RM3 on the Performance of RM3
and CombMNZ [RM3] . . . . . . . . . . . . . 36
6.2.1.4 Comparison with a Cluster-Based Method . . 38
6.2.1.5 Robust_MAP and Robust_P@5 Settings . . 39
6.2.1.6 Re-ranking Methods . . . . . . . . . . . . . . 41
6.2.2 Rank-Based Fusion Methods . . . . . . . . . . . . . . . 44
6.2.3 Score-Based and Rank-Based Fusion Methods Com-
parison . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Conclusions and Future Work 49

List of Tables
1 TREC corpora used for experiments . . . . . . . . . . . . . . . 27
2 Score-based fusion methods applied for Best_MAP and Best_P@5
on RM1 and RM3 . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Score-based fusion methods applied for Best_MAP and Best_P@5
on Rocchio1 and Rocchio3 . . . . . . . . . . . . . . . . . . . . 31
4 Score-based methods vs. cluster-based method [44] applied for
Best_MAP RM3 . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Score-based fusion methods applied for Robust_MAP and Ro-
bust_P@5 on RM1 and RM3 . . . . . . . . . . . . . . . . . . 40
6 Score-based fusion methods applied for Robust_MAP and Ro-
bust_P@5 on Rocchio1 and Rocchio3 . . . . . . . . . . . . . . 40
7 Rerank vs. Rev_rerank applied for Best_MAP and Best_P@5 43
8 Re-ranking methods applied for Best_MAP RM1 . . . . . . . 43
9 Rank-based fusion methods applied for Best_MAP and Best_P@5 45
List of Figures
1 The diﬀerence between the AP — average precision, an eval-
uation measurement of the general quality of ranking — per
query of Dinit (initial ranking) and that of the expansion-based
models (RM1 or RM3) over the ROBUST corpus (queries
301-350). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Score-based fusion methods applied for Best_MAP on RM1
and RM3 - a graph representation of Table 2. . . . . . . . . . 32
3 Score-based fusion methods applied for Best_MAP on Roc-
chio1 and Rocchio3 - a graph representation of Table 3 . . . . 32
4 Robustness improvement posted by CombMNZ . . . . . . . . 35
5 Robustness improvement posted by Interpolation . . . . . . . 35

6 RM3 and CombMNZ [RM3] MAP and robustness performance
when varying RM3’s query-anchoring parameter . . . . . . . . 37
7 Score-based vs. rank-based CombMNZ . . . . . . . . . . . . . 47
8 Score-based vs. rank-based Interpolation . . . . . . . . . . . . 48

Abstract
Search engines (e.g., Google) have become a crucial means for find-
ing information in large corpora (repositories). The ad hoc retrieval
task is the core challenge that search engines have to cope with: find-
ing documents that pertain to an information need underlying a query.
Pseudo-feedback-based (query expansion) retrieval is the process
by which the documents most highly ranked by a search performed in
response to a query are used for forming an expanded query, which
is used for (re-)ranking the entire corpus. The underlying idea is to
automatically “enrich” the original query with (assumed) related terms
so as to bridge the “vocabulary gap” between short queries and rele-
vant documents. While this process improves retrieval effectiveness on
average, there are many queries for which the resultant effectiveness
is inferior to that obtained by using the original query. In this work
we address this performance robustness problem by tackling the query
drift problem — the shift in "intention" from the original query to its
expanded form.
One approach for ameliorating query drift that we present relies on
fusing the lists retrieved in response to the query and to its expanded
form so as to perform query-anchoring of the latter. While the fusion-
based approach relies on co-occurrence of documents in retrieved lists,
our second approach re-ranks the (top) retrieval results of one retrieved
method by the scores assigned by another. Both approaches are based
on the assumption that documents retrieved in response to the ex-
panded query, and which exhibit high query similarity, are less likely
to exhibit query drift.
We show empirically that our approach posts significantly better
performance than that of retrieval based only on the original query.
The performance is also more robust than the retrieval using the ex-
panded query, albeit somewhat lower on average. This performance-
robustness trade-off characterizes most of the methods that we have
examined.
1

1 Introduction
Search engines (e.g., Yahoo!, Google) have become a crucial tool for finding
information in large corpora (repositories) of digital information (e.g., the
Web). One of the most important tasks that a search engine has to perform
is to find the most relevant documents to an information need underlying a
user query; this task is also known as the ad hoc retrieval task [72].
It is known that users tend to use (very) short queries to describe their
information needs [77]. Such reality poses significant challenges to a retrieval
system. Case in point, short queries might be interpreted in different ways,
hence suggesting different potential information needs (a.k.a. the polysemy
problem). Another challenge in handling short queries is the vocabulary mis-
match problem [13, 87]: some (or even all) query terms might not appear in
relevant documents.
One approach for addressing the vocabulary mismatch problem is to ask
the user to provide “feedback” on documents returned by an initial search
(e.g., via marks indicating which documents are relevant) so as to help the
retrieval algorithm to better “focus” the search; this can be done, for example,
by using additional terms (from the relevant documents) to expand the query
with [70]. However, in the vast majority of cases, no such user feedback is
available. Therefore, researchers proposed to take a pseudo-feedback (a.k.a.
blind feedback) approach to query expansion, which treats the documents
most highly ranked by an initial search as relevant; then, information induced
from these documents is used for modifying (expanding) the original query
[13, 87].
Naturally, there are several inherent problems in pseudo-feedback-based
query expansion. The first is that not all (and in many cases, only very few)
of the documents “treated” as relevant are indeed relevant; thus, the newly
formed “query model” does not necessarily reflect the information need un-
derlying the original query. Furthermore, the initially retrieved document
list that serves as the feedback set may not manifest all query-related as-
2

pects [9], and therefore, the new query model might not fully represent the
original information need. Indeed, it is known that while on average, pseudo-
feedback-based query expansion methods improve retrieval effectiveness over
that of retrieval using the original query alone, there are numerous queries
for which this is not true; that is, retrieval based on the original query alone
yields substantially better retrieval effectiveness than that of using an ex-
panded query form [20, 17]. This is known as the robustness problem of
pseudo-feedback-based query expansion retrieval methods.
The goal of the work presented here is to improve the robustness of
pseudo-feedback-based query expansion retrieval via the potential preven-
tion (amelioration) of query drift — the shift in “intention” from the original
query to its expanded form [57]. There is a strong trade-off between the
retrieval effectiveness (performance) and its robustness. This trade-off is due
to the fact that effectiveness is often measured as average over queries, and as
such, large improvement for few queries could significantly affect the average
result. Performance robustness (or lack of), on the other hand, is determined
by the percentage of queries for which retrieval effectiveness of using the ex-
panded form is inferior to that of retrieval based only on the original query. In
this work we aim to improve the robustness of pseudo-feedback-based query
expansion retrieval, while keeping its average effectiveness high.
Our first approach for preventing query drift is based on fusion; specif-
ically, we use fusion of document lists [27] that are retrieved by using the
original query and its expanded form. We "reward" documents that are
highly ranked both in response to the expanded form and to the original
query. We hypothesize that documents that are highly ranked in both re-
trieved lists are good candidates for being relevant since they constitute a
“good match” to both forms of the presumed information need. A document
ranked high by the initial retrieval can be assumed to have a high surface
level similarity to the original query (since the initial retrieval is often based
on similarity between the query and the document), while query expansion
3

can add aspects that were not in the original query but may be relevant
to the information need and may improve the retrieval. Hence, a document
that is ranked high by both the initial retrieval and the retrieval that is based
on the expanded form is assumed to potentially less suffer from query drift.
Another advantage of using fusion for the drift prevention task is its per-
formance efficiency due to the minimal overhead it has, resulting from the
fact that no additional retrieval is required (in addition to the initial and the
expansion-based retrievals).
We experimented with a score-based fusion approach and a rank-based
fusion approach. The former uses the retrieval scores assigned to the doc-
ument by the retrieval method, while the latter uses the positioning of the
document in the retrieved lists. Empirical evaluation of some fusion meth-
ods shows the promise of this direction. Specifically, through an array of
experiments conducted over various TREC corpora [83], which are standard
information retrieval benchmarks, we show that such a fusion-based approach
can improve the robustness of pseudo-feedback-based query expansion meth-
ods, without substantially degrading their average performance. We also
show that a score-based fusion approach yields better performance for this
task than a rank-based fusion approach. We demonstrate the merits of our
methods with respect to a state-of-the-art pseudo-feedback-based method
that was designed to improve performance robustness.
The second approach that we examine is based on using of re-ranking
methods to improve the robustness of pseudo-feedback-based query expan-
sion. While fusion-based approaches [27] rely on co-occurrence of documents
in retrieved lists, the re-ranking methods re-order the (top) retrieval results
of one retrieved method by the scores assigned by another retrieved method.
We show that the re-ranking approach can improve the robustness of pseudo-
feedback-based query expansion methods, with a minimal degrading of their
average performance.
Finally, we note that most previous approaches for improving the robust-
4

ness of pseudo-feedback-based retrieval use pre-retrieval query-anchoring —
anchoring at the model level, that “emphasizes” the query terms when con-
structing the expanded form [90, 1]. In contrast, our approach is a post-
retrieval query-anchoring paradigm, which performs query-anchoring via fu-
sion or re-ranking of retrieval results. We show that most of our methods
are eﬀective when applied on expansion-based models, whether they perform
pre-retrieval query-anchoring or not, where the latter’s robustness improve-
ment is naturally larger than that of the former’s. Hence pre-retrieval and
post-retrieval query-anchoring can be viewed as complementary.
5

2 Background
The retrieval models we use in this thesis utilize statistical language models
(LM). This chapter presents the concept of LM and lays down the definitions
and notations that will be used throughout this thesis.
2.1 LM - Introduction and Models
In order to rank documents we define p(d|q) - the probability that document
d is relevant to the query q. Using Bayes’ rule we get that p(d|q) = p(q|d)p(d)
p(q)
;
p(q) is document-independent. We assume that every d has the same prior
probability of relevance; i.e., p(d) is assumed to be uniformly distributed over
the corpus. Under these assumptions we score d by p(d|q)
rank
= p(q|d). We
use pd(q) as an estimate for p(q|d) [62, 56, 76, 24, 91].
In this thesis we use the term statistical language model to refer to a
probability distribution that models the generation of strings in a given lan-
guage. Various language models were developed for a variety of language
technology tasks such as: speech recognition, machine translation, document
classification and routing, spelling correction, optical character recognition
and handwriting recognition [69].
Ponte and Croft were the first to use LMs for the ad hoc information
retrieval task [62]. Since then many new models have been proposed, among
which are the query likelihood model and comparison model [38]. The Query
Likelihood Model is a widely used language model in IR [62, 56, 76, 24]. This
model estimates the probability that the model induced from document d
generates the terms in the query q (denoted as pd(q)), which is high when
the query terms appear many times in the document.
Model Comparison is a method that creates language models from both
the document and the query and compares these models in order to estimate
the difference between them [38].
6

2.2 Language Model Estimation
There are various methods for estimating pd(q). The unigram language model
is based on the assumption that terms are independent of each other; this
assumption holds for the occurrence of the terms qi in query q (qi ∈ q), as
well as for the terms in the document (a.k.a bag of terms representation).
Under these assumptions we estimate query likelihood by:
pd(q)
def
=
qi
pd(qi) (1)
Where pd(qi) represent the probability of the term qi given document d.
Let tf(w ∈ x) denote the number of times the term w occurs in the text
(text collection) x. We use a maximum likelihood estimate (MLE) for a
multinomial model with the unigram assumption. Specifically, we estimate
the probability of generating the term w from a language model induced from
text x as:
pMLE
x (w)
def
=
tf(w ∈ x)
w tf(w ∈ x)
;
w
tf(w ∈ x) is the length of x. (2)
2.3 Smoothing
One of the problems using MLE as defined above is that terms may appear
very sparsely in documents or not appear at all, causing problems with the
estimation of pd(q) as defined in Equation (1); this is known as the zero
probability problem. It is possible that some terms that are not a part of the
text at all, may still be connected with the information need underlying the
query, or may even have been used in the query itself. If we estimate these
terms’ probabilities as 0, following Equation (2) we get a strict conjunctive
semantics: the document’s LM will assign a query non-zero probability only
if all of the query terms appear in the document. To avoid this problem we
7

smooth the MLE. In general the following smoothing methods decrease the
probabilities assign by the LM to the words seen in the text and increase the
probability for the unseen words, using the corpus language model [16].
Jelinek-Mercer Smoothing: The Jelinek-Mercer based probability as-
signed to a term [34], which will be denoted in this work as pJM[λ]
(·), uses a
linear interpolation of the maximum likelihood estimate that is induced from
the document and that is induced from the corpus C. Specifically, we use a
free parameter λ to control the influence of these models [91].
p
JM[λ]
d (w) = (1 − λ)pMLE
d (w) + λpMLE
C (w)
Bayesian Smoothing using Dirichlet Priors Following Zhai and Laf-
ferty [38] we can set λ = µ
µ+ w tf(w ∈x)
where µ is a free parameter, and get
the Bayesian smoothing assigned to a term and induced from document d,
using Dirichlet priors, which will be denoted in this work as p
Dir[µ]
x (·). This
smoothing, unlike the Jelinek-Mercer method, depends on the text’s length.
2.4 Measures
Kullback-Leibler (KL) divergence KL divergence [35] , which is named
after Kullback and Leibler, is a non-commutative measure of the difference
between two probability distributions A and B:
KL(A B) =
i
A(i)log
A(i)
B(i)
where i is an event.
Using this measure to estimate the difference between the query model
and the document model can be stated as follows [38] :
8

KL(pq(·)||pd(·)) =
w
pq(w)log
pq(w)
pd(w)
where w is a term in the vocabulary.
In practice we use MLE for pq(w). When using MLE, the query likelihood
model and the KL ranking model are rank equivalent [38].
Cross Entropy (CE) CE (like KL) is a non-commutative measure of
the difference between two probability distributions A and B:
CE(A B) = −
i
A(i)logB(i)
where i is an event with the probabilities A(i) and B(i). Using this measure
to estimate the difference between the query model and the document model
can be stated as follows:
CE(pq(·)||pd(·)) = −
w
pq(w)logpd(w)
where w is a term in the vocabulary.
As can be seen KL is using both distributions’ entropy while CE uses only the
second (right) one. As shown by Lafferty and Zhai [38], CE is rank equivalent
to the KL ranking model and to query likelihood KL(q d)
rank
= CE(q d).
9

3 Query Expansion
Search engine users tend to use (very) short queries to describe their informa-
tion needs [77]. Such reality poses signiﬁcant challenges to retrieval systems.
One of the problems with short queries is that a single query can be the
derivative of diﬀerent potential information needs (a.k.a. the polysemy prob-
lem). For example, the query “Jaguar” can be interpreted as “a nice car” or
as “a wild animal”.
Another challenge in handling short queries is the vocabulary mismatch
problem [13, 87]: some (or even all) query terms might not appear in relevant
documents. For example, the text span “nature pictures” is relevant to the
query “view photos”, although it does not contain query terms.
3.1 Pseudo-Feedback
One approach for addressing the vocabulary mismatch problem is asking the
user to provide “feedback” on documents returned from an initial search,
for example, by asking the user to mark documents that are in his opinion
relevant. This “feedback” is than used to improve the retrieval performance.
This can be done, for example, by using additional terms (from the relevant
documents) to expand the query with [70].
However, in most cases user feedback is unavailable. This is the reason
why researchers proposed taking a pseudo-feedback (a.k.a. blind feedback)
approach to query expansion, as is done in this work. In the pseudo-feedback
approach the documents most highly ranked by an initial search are usually
treated as relevant; and information induced from these documents is used
for modifying (expanding) the original query [13, 87].
We compute an initial retrieval score of document d in response to query
q by using cross entropy:
Scoreinit(d)
def
= exp(−CE pDir[0]
q (·) p
Dir[µ]
d (·) )
10

We use Dinit to denote the initial list — the set of documents d with the
highest Scoreinit(d), and n as the size of Dinit.
We compute an expanded retrieval score of document d in response to
query q using a pseudo-feedback-based query expansion model (see Sec-
tion 3.2). We use PF(Dinit) to denote the expansion-based list — the set
of documents d with the highest Scorepf(d). Scoreinit(d) and Scorepf(d) as-
sign 0 if d is not in these lists, respectively.
3.2 Query Expansion Models
The Relevance Model - RM1: The relevance model is a well known query
expansion model [41, 40]. The relevance model paradigm assumes that there
exists a (language) model RM1 that generates terms both in the query and
in the relevant documents. The basic relevance model is defined by:
pRM1(w)
def
=
d∈Dinit
pd(w)p(d|q) (3)
This can be interpreted as a weighted average of pd(w) with weights given by
p(d|q) — the probability that document d is relevant to the query q, which is
interpreted here as normalized likelihood by using the query likelihood (see
Equation 1) and Bayes’ rule.
Using the LM definitions from Section 2.1, we can estimate RM1. Let qi
be the set of query terms. The RM1 model is then defined by
pRM1(w; n, α)
def
=
d∈Dinit
p
JM[α]
d (w) i p
JM[α]
d (qi)
dj∈Dinit i p
JM[α]
dj
(qi)
.
In practice, the relevance model is clipped by setting pRM1(w; n, α) to 0
for all but the β terms with the highest pRM1(w; n, α) to begin with. This is
done in order to improve retrieval performance and computation speed. [18,
55, 23]; further normalization is then performed to yield a valid probability
11

distribution, which we denote by pRM1(·; n, α, β). We score document d with
respect to the relevance model pRM1(·; n, α, β) by
ScoreRM1(d)
def
= exp(−CE p
Dir[µ]
d (·) pRM1(·; n, α, β) ).
The Interpolated Relevance Model - RM3: The relevance model
RM1 as shown above may suffer from query drift — a shift of intention with
respect to the origin query [57]. In order to further emphasize the original
query, Abdul-Jaleel et al. [1, 23] suggested query-anchoring at the model level
(i.e., pre-retrieval query-anchoring). Specifically, using a linear interpolation
between the relevance model and the original query model:
pRM3(w)
def
= λpMLE
q (w) + (1 − λ)pRM1(w) (4)
Using the LM definitions from Section 2.1 we get:
pRM3(w; n, α, β, λ)
def
= λpDir[0]
q (w)+(1−λ)
d∈Dinit
p
JM[α]
d (w) i p
JM[α]
d (qi)
dj∈Dinit i p
JM[α]
dj
(qi)
.
We can then score document d by:
ScoreRM3(d)
def
= exp(−CE p
Dir[µ]
d (·) pRM3(·; n, α, β, λ) )
.
Rocchio: If we set p(d|q) to a uniform distribution we get the following
model, which is reminiscent of Rocchio’s relevance feedback model in the
vector space [68]:
pRocchio3(w; n, β)
def
= λpMLE
q (w) + (1 − λ) ·
1
n
·
d∈Dinit
pd(w) (5)
12

Where n is the size of Dinit. Note that due to the uniform distribution
assumption all documents in Dinit are equal contributors to the constructed
model. When λ = 1 Rocchio’s algorithm discards anchoring with the original
query model, and relies only on the expansion-based component.
pRocchio1(w; n, β)
def
=
1
n
·
d∈Dinit
pd(w) (6)
We can then score document d by:
ScoreRocchio(d)
def
= exp(−CE p
Dir[µ]
d (·) pRocchio(·; n, β) )
.
Comparison of Query Expansion Models
As can be seen in the table below, there are two important properties by
which the above-described query expansion models could be characterized:
• Does the model weigh documents with respect to p(d|q)?
• Does the model perform an interpolation with the original query model?
i.e., Does the model perform pre-retrieved query-anchoring.
Model Weigh Interpolation
with respect with the
to p(d|q) original query
RM1
d∈Dinit
pd(w)p(d|q)
RM3
λpMLE
q (w) + (1 − λ) d∈Dinit
pd(w)p(d|q)
Rocchio1
1
n
· d∈Dinit
pd(w)
Rocchio3
λpMLE
q (w) + (1 − λ) · 1
n
· d∈Dinit
pd(w)
13

We will later show that these properties have a major impact on the
overall performance.
3.3 The Performance Robustness Problem
Naturally, there are several inherent problems in pseudo-feedback-based query
expansion. The first is that not all (and in many cases, only very few) of
the documents “treated” as relevant (i.e., those in Dinit) are indeed relevant.
This leads to the newly formed “query model” not necessarily reflecting the
information need underlying the original query.
Furthermore, the initially retrieved document list serving as the feedback
set Dinit, may not reveal all query-related aspects [9], and therefore, the new
query model might not fully represent the original information need. Indeed,
while on average, pseudo-feedback-based query expansion methods improve
retrieval effectiveness over that of retrieval using the original query, there are
numerous queries for which this is not true. For these queries the retrieval
based on the original query yields substantially better retrieval effective-
ness than that yielded by using an expanded query form [20, 17], as can be
seen in Figure 1. This is known as the performance robustness problem of
pseudo-feedback-based query expansion retrieval methods. The performance
robustness problem is mainly caused by query drift — the shift in “intention”
from the original query to its expanded form [57].
As can be seen in Figure 1, RM1 and RM3 show different degrees of the
performance robustness problem. Not only that a larger number of queries
in RM1 suffer from performance degradation when using the expanded form,
but the damage to queries is much more severe. This difference is mainly
due to the fact that RM3 uses a query-anchoring at the model level (see
Section 3.2). This difference will have an influence on the results at the
experimental section and will be further discussed there (see Chapter 6).
14

-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
302
304
306
308
310
312
314
316
318
320
322
324
326
328
330
332
334
336
338
340
342
344
346
348
350
APdifference
Queries
RM1 query drift - ROBUST corpus queries 301-350
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
302
304
306
308
310
312
314
316
318
320
322
324
326
328
330
332
334
336
338
340
342
344
346
348
350
APdifference
Queries
RM3 query drift - ROBUST corpus queries 301-350
Figure 1: The diﬀerence between the AP — average precision, an evaluation
measurement of the general quality of ranking — per query of Dinit (initial ranking)
and that of the expansion-based models (RM1 or RM3) over the ROBUST corpus
(queries 301-350).
15

4 Query Drift Prevention
4.1 Improving Robustness Using Fusion
Retrieval performance can be significantly improved using data fusion, a
combination of retrieval methods, query representations or document repre-
sentations. [53, 21, 27, 43, 80, 42]
The goal of this work is to improve the performance robustness problem
of pseudo-feedback-based query expansion retrieval, that was described and
exemplified in Section 3.3, via the potential prevention (amelioration) of
query drift.
Using data fusion can potentially prevent query drift. Our motivation
for using fusion comes from a few key assumptions. We hypothesize that
documents that are highly ranked in both retrieved lists (the initial list and
the expansion-based list) are good candidates for being relevant since they
constitute a “good match” to both forms of the presumed information need.
A document that is ranked high by the initial retrieval (Dinit) can be
assumed to have a high surface level similarity to the original query, while
performing query expansion (PF(Dinit)) can add aspects that were not in the
original query, but may be relevant to the information need and may improve
the retrieval in certain cases (e.g., short queries, the vocabulary mismatch
problem etc.; see Chapter 3). Using this expansion however, may cause
a shift in the intension from the original query — query drift. A document
that is ranked high by both the initial retrieval and the expansion is assumed
(potentially) not to suffer from query drift.
Another key assumption is based on the statement that documents that
are retrieved using a variety of query representations have a high chance of
being relevant [7]. In our case the original query and the expanded base
query are the different query representations.
A major advantage of using fusion for the query drift prevention task
is it’s performance efficiency. This performance efficiency is due to a mini-
16

mal overhead, derived by the fact that no additional retrieval executions are
needed, beside the initial retrieval and the expanded based one, that are an
integral part of an expansion-based method.
Despite the above advantages that show promise in this direction, using
fusion is not without it’s disadvantages. Following the fusion concept, docu-
ments retrieved only by the expansion-based list are downgraded, which can
contradict the essence of using query expansion to begin with.
4.2 Score-Based Fusion Methods
Similarity measures (e.g., cosine-similarity) are widely used in literature for
combining retrieval scores, which act as multiple evidence regarding the re-
lationships between the query and the document [67, 27, 30, 60].
4.2.1 Symmetric Fusion Methods
The following retrieval methods essentially operate on Dinit ∪ PF(Dinit).
These methods perform symmetric fusion, where a document’s appearance
in the ﬁnal retrieved list is due to it’s appearance in any or both lists (i.e.,
the initial retrieved list and the expansion-based one).
CombMNZ: The CombMNZ method, which was introduced by Fox and
Show in the data fusion framework [27, 43], rewards documents that are
ranked high in both Dinit and PF(Dinit):1
ScoreCombMNZ(d)
def
=
(δ[d ∈ Dinit] + δ[d ∈ PF(Dinit)]) ·
Scoreinit(d)
d ∈Dinit
Scoreinit(d )
+
Scorepf(d)
d ∈PF(Dinit) Scorepf(d )
.
Note that a document that belongs to only one of the two lists (Dinit and
PF(Dinit)) can still be among the highest ranked documents.
1
For statement s, δ[s] = 1 if s is true and 0 otherwise.
17

The Interpolation algorithm: The Interpolation algorithm, which
was used for preventing query drift in cluster-based retrieval (e.g., [36]),
diﬀerentially weights the initial score and the pseudo-feedback-based score
using an interpolation parameter λ:
ScoreInterpolation(d)
def
=
λδ[d ∈ Dinit] Scoreinit(d)
d ∈Dinit
Scoreinit(d )
+
(1 − λ)δ[d ∈ PF(Dinit)] Scorepf(d)
d ∈PF(Dinit) Scorepf(d )
.
4.2.2 Re-ordering Methods
Re-ordering methods, unlike the fusion methods described above, are asym-
metric. Three methods for re-ordering top documents will be used in this
work:
Rerank: The Rerank method (e.g., Kurland and Lee [36]) re-orders the
(top) pseudo-feedback-based retrieval results by the initial scores of docu-
ments, in order to prevent query drift that may be caused by the expansion.
The method’s asymmetry can be seen in the fact that only the (top) doc-
uments in the pseudo-feedback-based retrieval results will be re-ranked and
will appear in the ﬁnal method’s results list. We score document d with
respect to the Rerank model by:
ScoreRerank(d)
def
= δ[d ∈ PF(Dinit)] Scoreinit(d).
Rev_rerank: The Rev_rerank method re-orders the (top) initial re-
trieval results by the pseudo-feedback-based scores of documents, in order
to take into account the additional aspects that come with the expansion,
while minimizing query drift. Unlike other methods, this method is not a
drift-prevention method. It is a mirror method of the Rerank method.
ScoreRev_rerank(d)
def
= δ[d ∈ Dinit] Scorepf(d).
18

Centroid_rerank: The Centroid_rerank method re-orders the (top)
pseudo-feedback-based retrieval results, created using RM1, by the centroid
it creates from the initial documents using the Rocchio1 model.
ScoreCentroid_rerank(d)
def
=
1
n d∈Dinit
pd(w)
Where n is the size of Dinit. Both RM1 and Rocchio1 can be viewed as
methods that create a centroid, and the main difference between these cen-
troids is that RM1 uses a weighted centroid (weights are p(d|q), see Equa-
tion (3)), while Rocchio1 uses a non-weighted centroid. The Centroid_rerank
method utilizes the cluster hypothesis (see Section 5.1), which implies that
relevant documents tend to have a higher similarity to each other than to
non-relevant documents [33, 32]. The centroid is created from the initial
documents which are query-oriented and is assumed to represent a relevant
document. Hence, based on the hypothesis, we believe that re-ordering
of the documents using the centroid would potentially rank high relevant
documents, since they will have a higher similarity to the centroid than non-
relevant documents will.
4.3 Rank-Based Fusion Methods
Lee [43] claimed that using rank(ing) for data fusion can sometimes achieve
better results than using retrieval scores, since the latter has the effect of
weighting individual retrieved list without considering their overall perfor-
mance. Lee found that using rank gives better retrieval effectiveness than
using similarity if the lists in the combination generate quite different curves
of rank over similarity. Following Lee we define:
SrankSim(rank(d))
def
= 1 −
rank(d) − 1
Nd
19

where Nd is the number of retrieved documents, and rank(d) is the rank in
which d is positioned:
∀ d ∈ D, rank(d) : D → {1, ..., Nd};
In addition to the above definition of CombMNZ and Interpolation using
retrieval scores, we will test the effectiveness and robustness of those methods
using this ranking as well.
Borda Rank : The Borda count, first introduced in 1770 by Jean-Charles de
Borda [22], is originally a voting method in which each voter gives a ranking
of all possible alternatives, and for each alternative, all the votes are added
up and the alternative with the highest number of votes wins the election.
Based on this idea scoring using Borda rank is defined as:
SbordaRank(rank(d))
def
=
ranki(d)
(Nd − ranki(d))
where ∀ d ∈ D, ranki(d) : {1, ..., Nd} → D;
20

5 Related Work
5.1 Fusion
Retrieval performance can be significantly improved using data fusion, i.e.,
combining retrieval methods and/or query representations and/or document
representations [25, 27, 43, 42, 80].
One common explanation for this improvement states that the search
process is complex, with many degrees of uncertainty, and therefore any in-
dividual search result (even good ones) are only partial to the entire potential
result space of the information need. Following this assumption, studies show
that different retrieval models, which achieve similar performance, may re-
trieve different sets of documents in response to the same queries [30]. Using
a variety of methods (results) will utilize different aspects of the search space
and hence will return more relevant results [67, 60, 46]. Moreover, the more
sources of evidence are available regarding the relationships between a query
and a document, the more accurate the judgment of the probability of rele-
vance of a document to a query will be [67].
It was found that there is a larger overlap among relevant documents
than that among non-relevant documents [43, 5, 60]. Namely, relevant doc-
uments tend to be retrieved using different methods more than non-relevant
documents tend to, and hence the odds of a document being judged relevant
may be proportional to the number of times it appears in the retrieved sets
[73, 43]. The CombMNZ method (see Section 4.2.1) ranks high co-occurring
documents, corresponding to the observations above. However, it has been
shown that in some cases, the lists to be fused contain different relevant
documents, and hence in such cases fusion methods may not be as effective
[21, 28, 75, 6].
While the CombMNZ method rewards co-occurring documents, following
the observations above, our Centroid_rerank method focuses on similari-
ties between documents, adopting the cluster hypothesis (for details see Sec-
21

tion 4.2.2). The cluster hypothesis implies that relevant documents tend to
have a higher similarity to each other than to non-relevant documents [33, 32].
Many retrieval methods have adopted this hypothesis, such as cluster-based
retrieval model [33, 19, 36, 50], cluster-based re-ranking [46, 45] and cluster-
based pseudo-relevance-feedback [51, 37, 44].
Data fusion based on query representations is one line of research in the
data fusion field. The query represents the searcher’s information need, and
tends to be short (incomplete representation of this need). Hence, performing
data fusion based on query representation may capture more pieces of
evidence about the true (whole) information need, and improve retrieval
performance [7, 64] . The work presented in this thesis essentially uses query
representations fusion. Specifically, we use fusion based on the original query
and its expanded form (see Chapter 4).
Studies (e.g., [7, 64]) show that a combination of query representations
can significantly improve retrieval performance, and that the overall perfor-
mance depends on the individual queries’ performance. It was also found that
bad query representations (i.e., individually achieving low performance),
when combined with better query representations could hurt the overall re-
trieval performance [7].
5.2 Improving the Performance Robustness of Pseudo-
Feedback-Based Retrieval
Approaches for improving the robustness of pseudo-feedback-based meth-
ods (see Section 3.3) have mainly been based on selecting (and weighting)
documents from the initial search upon which pseudo-feedback is performed
[8, 58, 48, 71, 79, 17]; and on selecting and weighting terms from these docu-
ments [57, 63, 15, 54, 8, 14] for defining a new query model. Such approaches
can potentially help to improve the performance of our methods that utilize
information from a document list retrieved in response to an expanded form
of the query.
22

It has been shown [29, 58, 84] that given a speciﬁc retrieval environment
(e.g., retrieval method, query representation, etc.), there exists an optimal
number of documents to be used for expansion (i.e., adding documents will
hurt performance as will subtracting documents). However, this number
varies with the retrieval environment. No explicit relationship was found
between query features (e.g., the query length) or corpus features (e.g., the
number of relevant documents) and the optimal number of documents for
feedback.
In contrast to the line of research which relies on the top (consecutive)
documents from the initial list as the basis for performing expansion (e.g.,
[13, 87]), other methods use only a (non consecutive) part of the (top) initial
list. In this approach some documents in the initially ranked list may be
skipped. This approach is based on one’s willingness to potentially detect
the relevant documents and skip the non-relevant ones, and on the phe-
nomenon that some truly relevant documents may hurt performance when
used for relevance feedback (a.k.a poison pills) [29, 58, 84].
One suggested approach for choosing only part of the initial list is based
on clustering. Examples for this approach are removing singleton document
clusters from the initial ranked lists [51], and using the clusters that best
match the query for the expansion [11]. Cluster-based methods for pseudo-
relevance-feedback were also proposed for re-sampling documents [44] (which
we will compare our methods to; see Section 6.2.1.4), for term expansion
[12, 89, 52, 10] and for iterative pseudo-query processing using cluster-based
language models induced from both documents and clusters [37].
The typical source of terms for query expansion is the set of all/selected
terms from the documents that appear in the initially retrieved list. Dif-
ferent methods aim to select candidate terms for query expansion by using
additional information, such as the results of previous similar queries [26],
information induced from passages [31, 87, 88] or from document summaries
[39], the term distribution in the feedback documents [78, 57, 74] or the com-
23

parison between the term distribution in the feedback documents and that
in the whole document collection [65, 66, 15]. In contrast to the approaches
proposed above, a recent study [14] has shown that helpful expansion terms
cannot be distinguished from harmful ones merely based on those terms’
distributions.
Another line of research focuses on predicting whether a given expanded
form of the query will be more effective for retrieval than the original query
[3, 20], or on which expansion form will perform best from a set of candidates
[85]. One such prediction method uses the overlap between the documents
in the initial list and the expansion-based list [20]. This method and some of
our methods use the same assumption that documents appearing in both the
initial list and the expansion-based list have a strong potential to be relevant
to the query, unlike documents that appear only in the expansion-based list,
which may result from query drift — the shift in “intention” from the original
query to its expanded form [57] (for more details see Section 3.3). Cronen-
Townsend et al. [20] and Winaver et al. [85] tackle the query-drift problem
from two related angles. Specifically, minimizing query drift serves as a prin-
ciple for setting a procedure by which one decides whether to employ query
expansion [20], or which expansion form to choose [85]. In contrast to these
approaches that explicitly quantify query drift by measuring the distance
between retrieved lists, our approach — which is also based on minimizing
query drift — uses a fusion-based approach for implicitly ameliorating the
drift. Furthermore, the task we tackle here is improving the performance of
a specific expansion form via drift-minimization rather than selecting such a
form from a set of candidates [85], or deciding whether to employ this form
in the first place [20]. Our approach is also supported by a recent study [59]
which states that fusion achieves better performance than selecting a single
expansion-based form.
Another approach that aims to tackle the query-drift problem suggests
integrating multiple expansion models [59]. This in done by using estimates
24

of their faithfulness to the presumed information need (similar to the ap-
proach used for selecting a single relevance model [85]). Resembling this
approach, we address the uncertainty with respect to the information need
by using multiple query representations. In contrast to this approach, which
integrates multiple expansion models using estimates of their faithfulness,
we integrate in our fusion methods only one expansion model and the initial
list using simple/no weights, which keeps the methods’ efficiency hight by
maintaining a minimal overhead.
Another attempt to improve the robustness of pseudo-feedback-based
query expansion [61] proposes re-ranking documents from the expansion-
based list using the initial list, similar to our Rerank method (see Section
4.2.2), with the exception of re-ranking being based on similarity between
documents in the lists. This follows the cluster hypothesis and is conceptu-
ally similar to our Centroid_rerank method (see Section 4.2.2). Another
method that proposes re-ranking based on both similarity and co-occurrence
of documents in the lists [61] shows the merits of using co-occurrence in ad-
dition to similarity, and supports one of the main ideas that were proposed
in our work, which is that the co-occurrence of documents in the initial and
the expansion-based list is important for the success of the robustness im-
provement of pseudo-feedback-based query expansion.
While we present in this work a post-retrieval query-anchoring approach,
which performs query-anchoring via fusion of the lists retrieved using the orig-
inal query and its expanded form, some previous work suggests to anchor the
creation of the new query model to the original query (pre-anchoring) using
two main methods, interpolation and differential weighting. Zhai and Laf-
ferty [90] and Abdul-Jaleel et al. [1] use query-anchoring at the model level.
Specifically, the idea is to perform a linear interpolation of the expansion
model with that of the original query, in order to emphasize the information
in the original query. Another approach suggests bringing back the original
query into the relevance model by treating it as a short, special document,
25

in addition to a number of the (top) initial ranked documents [47].
Most of the methods introduced in this chapter can be viewed as comple-
mentary to ours, since they aim to improve the quality of the list retrieved
by an expanded form of the query; such a list is an important “ingredient”
in our methods, and improving its “quality” may contribute to the overall
eﬀectiveness of our approach.
26

6 Experiments
6.1 Experimental Setup
6.1.1 Corpora
We use the TREC corpora from Table 1 for experiments. Topics’ titles serve
as queries; only queries with at least one relevant document are considered.
Table 1: TREC corpora used for experiments
Corpus Queries Disks
TREC1-3 51-200 1-3
ROBUST 301-450, 601-700 4,5
WSJ 151-200 1-2
SJMN 51-150 3
AP 51-150 1-3
We tokenize the data, apply Porter stemming, and remove INQUERY
stopwords [2], via the Lemur toolkit (www.lemurproject.org), which is also
used for retrieval.
6.1.2 Evaluation Metrics
To evaluate retrieval performance, we use two widely accepted metrics:
• Mean average precision at 1000 (MAP), which evaluates the general
quality of ranking methods [82].
• Precision of the top 5 documents (P@5), which measures the ability of
retrieval methods to position relevant documents at the highest ranks
of the retrieved results.
27

We determine statistically significant differences in performance using Wilcoxon’s
two-tailed test at a confidence level of 95%.
In addition, we present for each retrieval method the percentage of queries
(in each benchmark) for which the performance (MAP or P@5) is worse
than that of the initial ranking from which Dinit is derived. (We denote this
percentage by “Init”.) The lower “Init” is, the more robust we consider
the method at hand.
6.1.3 Parameter Tuning
We set Dinit to the 1000 documents in the corpus that yield the highest
initial ranking score, Scoreinit(d). Here and after, we set µ — the docu-
ment language model smoothing parameter (refer back to Section 2.3) — to
1000, following previous recommendations [91]. To create a set PF(Dinit)
of 1000 documents, the values of the expansion models’ free parameters
(refer back to Section 3) were selected from the following sets [83]: n ∈
{25, 50, 75, 100, 500, 1000} and β ∈ {25, 50, 75, 100, 250, 500, 1000}. The α
parameter used in RM1 and RM3 is chosen from {0, 0.1, 0.2, 0.3}. The pa-
rameter λ that controls query-anchoring in the RM3 algorithms is chosen
from {0.1, . . . , 0.9}.
As the performance of pseudo-feedback-based query expansion methods
is considerably affected by the values of free parameters, and as there is
a performance-robustness trade-off, we have taken the following design de-
cision; we apply our methods upon 4 different optimized settings of these
parameters, so as to study the effectiveness of our approach:
• Best_MAP: the values of the free parameters of the expansion-based
models are selected so as to optimize MAP performance of PF(Dinit).
• Best_P@5: the values of the free parameters of the expansion-based
models are selected so as to optimize P@5 performance of PF(Dinit).
28

• Robust_MAP: the values of the free parameters of the expansion-based
models are selected so as to optimize MAP robustness of PF(Dinit) (i.e.,
minimize Init of MAP).
• Robust_P@5: the values of the free parameters of the expansion-based
models are selected so as to optimize P@5 robustness of PF(Dinit) (i.e.,
minimize Init of P@5).
The only free parameter used in our methods is a parameter which con-
trols query-anchoring in the Interpolation model and which was chosen from
{0.1, . . . , 0.9}, to optimize each of the 4 settings (i.e., when applied on Best_MAP
and Best_P@5 it is chosen to optimize the MAP/P@5 performance respec-
tively, while when applied on Robust_MAP and Robust_P@5 it is chosen
to optimized the MAP/P@5 robustness respectively).
6.2 Experimental Results
6.2.1 Score-Based Fusion Methods
6.2.1.1 Best_MAP and Best_P@5 Settings
Relevance models Table 2 presents the performance numbers of score-
based fusion methods applied on the Best_MAP and the Best_P@5 settings
of the relevance models RM1 and RM3. Analyzing the MAP results leads
to a number of key observations. One observation is that all fusion-based
methods yield MAP performance that is always better — to a statistically
signiﬁcant degree — than that of the initial ranking that utilizes only the
original query. As can be seen, the Interpolation algorithm yields the best
MAP performance among the fusion-based methods, but it incorporates a
free parameter while the other two methods (CombMNZ and Rerank) do
not.
Exploring the trade-oﬀ between MAP and robustness, one can see that
in general the better the robustness, the lower the MAP. As can also be
29

seen in Table 2 and in Figure 2, all three fusion-based methods are more
robust (refer to the Init measure) than the model they incorporate (RM1
or RM3). In addition, the methods based on RM3 are generally more robust
than their corresponding methods that are based on RM1, but have lower
MAP. This calls for further research as these methods are employed upon
the RM3 model that has better MAP than RM1. One explanation for this
observation can be that the methods based on RM3 perform much stronger
query-anchoring than the methods based on RM1 and hance they have better
robustness and lower MAP than the corresponding methods based on RM1.
In terms of MAP, Rerank [RM3] is the most robust among the tested methods
in Table 2.
Another observation that we make based on Table 2 is that CombMNZ
[RM1], which performs query-anchoring via fusion of retrieved results, and
Rerank [RM1] are more robust than RM3 that performs query-anchoring at
the relevance model level. RM3 however, is more effective in terms of MAP,
to a statistically significant degree.
In terms of P@5, as can be seen in Table 2, the phenomena observed
for Best_MAP, where all three fusion-based methods are more robust than
the model they incorporate (RM1 or RM3), happens for Best_P@5 as well.
For P@5, the very same trade-off between performance (in this case - P@5)
and robustness is also observed. It can be seen that in general the better
the robustness, the lower the P@5. As can also be seen, the Interpolation
algorithm performed on RM1 achieves a good trade-off, where a relatively
large improvement of robustness (compared to RM1) comes at a relatively
small price with respect to P@5 performance. Another interesting result is
that for both RM1 and RM3, the use of Rerank yields the most robust score
(i.e., has the lowest Init) out of all three fusion-based methods, and the
incorporated model itself (RM1 or RM3).
30

Table2:Score-basedfusionmethodsappliedforBest_MAPandBest_P@5onRM1andRM3
TREC1-3ROBUSTWSJSJMNAP
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
RM119.2i38.744.4i25.327.5i45.449.223.733.2i34.056.422.024.1i37.039.6i15.028.5i38.452.1i21.2
Interpolation
[RM1]
19.5i31.341.115.329.3ir34.949.512.434.0i26.054.4i12.023.6i27.036.6ir11.028.6i31.348.113.1
CombMNZ
[RM1]
18.2i24.038.9r4.728.0i28.548.56.031.1i14.051.24.021.6i20.035.0ir3.026.9i21.245.5r3.0
Rerank[RM1]17.5ir27.338.9r1.326.3i30.948.22.829.8ir22.051.24.020.4ir16.033.6r1.025.9ir20.245.3r1.0
RM320.0i28.744.8i24.030.0i28.150.614.534.8i20.058i14.024.6i29.039.6i15.029.1i28.352.1i17.2
Interpolation
[RM3]
19.6ir22.742.7i20.729.3ir27.749.212.433.8ir22.056.4i14.023.9ir24.038.8i13.028.7i27.348.9r15.2
CombMNZ[RM3]17.9ir16.739.7ir6.027.1ir19.347.8r7.230.7ir18.054.412.021.6ir23.034.612.026.5ir16.246.3r4.0
Rerank[RM3]16.9ir22.738.4r0.025.5ir15.347.80.028.4ir14.051.2r0.019.9ir11.033.61.025.1ir12.145.5r0.0
Performancenumbersoftheinitialrankingthatisbasedonusingonlytheoriginalquery,theBest_MAPandBest_P@5
settingsoftherelevancemodelsRM1andRM3,andthescore-basedfusionmethods.Boldface:bestresultpersub-column;i
andrindicatestatisticallysignificantMAPorP@5differenceswiththeinitialrankingandtheincorporatedmodel(RM1
orRM3),respectively.
Table3:Score-basedfusionmethodsappliedforBest_MAPandBest_P@5onRocchio1andRocchio3
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
Rocchio119.2i40.743.1i21.326.951.845.734.931.7i42.055.622.023.2i45.039.225.029.4i32.352.3i21.2
Interpolation
[Rocchio1]
19.4i34.042.120.729.1ir38.648.26.833.5ir30.055.618.023.0i32.038.0i12.029.4i21.251.1i15.2
CombMNZ
[Rocchio1]
18.2i24.040.315.327.7i26.548.2r8.032.4i22.052.014.021.6i25.035.8i1.027.6ir18.248.1i3.0
Rerank
[Rocchio1]
17.4ir30.038.5r0.726.3i32.148.02.429.7i20.050.88.020.1i27.034.01.026.2ir17.245.9r0.0
Rocchio319.9i33.343.1i24.729.2i36.549.310.034.0i26.059.616.024.3i31.040.4i17.029.8i23.252.9i20.2
Interpolation
[Rocchio3]
19.5i30.042.422.728.5ir31.348.48.833.2i20.056.0r16.023.8i24.038.4i15.029.0ir18.250.1i11.1
CombMNZ
[Rocchio3]
17.9ir22.039.38.726.7ir23.348.04.430.3ir18.052.8r14.021.6ir18.035.8ir9.026.7ir14.145.7r4.0
Rerank
[Rocchio3]
16.9ir22.738.30.025.6ir21.747.80.028.3ir12.051.22.020.0ir7.033.0r0.025.2ir10.145.7r0.0
settingsofthemodelsRocchio1andRocchio3,andthescore-basedfusionmethods.Boldface:bestresultpersub-column;i
andrindicatestatisticallysignificantMAPorP@5differenceswiththeinitialrankingandtheincorporatedmodel(Rocchio1
andRocchio3),respectively.
31

Figure 2: Score-based fusion methods applied for Best_MAP on RM1 and RM3
- a graph representation of Table 2.
10
15
20
25
30
35
MAP
TREC1-3 ROBUST WSJ SJMN AP
Corpus
Score-based fusion methods applied for RM1 - MAP
RM1
Interpolation
CombMNZ
Rerank
10
15
20
25
30
35
40
45
50
Init
Corpus
Score-based fusion methods applied for RM1- Robustness
RM1
Interpolation
CombMNZ
Rerank
10
15
20
25
30
35
MAP
Corpus
Score-based fusion methods applied for RM3 - MAP
RM3
Interpolation
CombMNZ
Rerank
10
15
20
25
30
35
40
45
50
Init
Corpus
Score-based fusion methods applied for RM3- Robustness
RM3
Interpolation
CombMNZ
Rerank
Figure 3: Score-based fusion methods applied for Best_MAP on Rocchio1 and
Rocchio3 - a graph representation of Table 3
10
15
20
25
30
35
MAP
Corpus
Score-based fusion methods applied for Rocchio1 - MAP
Rocchio1
Interpolation
CombMNZ
Rerank
5
10
15
20
25
30
35
40
45
50
55
Init
Corpus
Score-based fusion methods applied for Rocchio1- Robustness
Rocchio1
Interpolation
CombMNZ
Rerank
10
15
20
25
30
35
MAP
Corpus
Score-based fusion methods applied for Rocchio3 - MAP
Rocchio3
Interpolation
CombMNZ
Rerank
5
10
15
20
25
30
35
40
45
50
55
Init
Corpus
Score-based fusion methods applied for Rocchio3- Robustness
Rocchio3
Interpolation
CombMNZ
Rerank
32

Rocchio’s models While insofar we have focused on RM1 and RM3, we now
turn to analyze the performance of our methods when applied on Rocchio1
and Rocchio3, where (as previously discussed in Chapter 3) the first pair
weigh documents with respect to p(d|q), while the second does not. We see
in Table 3 performance patterns similar to those presented in Table 2; one
such pattern is the MAP performance of the initial ranking that utilizes only
the original query, that is always lower to a statistically significant degree
than that of all score-based fusion methods. Additional similar pattern is
that all three fusion-based methods are more MAP-robust than the model
they incorporate (in this case, Rocchio1 or Rocchio3). The trade-off between
MAP and robustness is also maintained.
We can also see in Table 3 that the Interpolation algorithm yields the
best MAP performance among the score-based fusion methods for both Roc-
chio1 and Rocchio3. Moreover, Interpolation [Rocchio1] yields better MAP
performance than that of Rocchio1 itself for most corpora, and it is always
more robust.
In terms of MAP, we can see that CombMNZ [Rocchio1], which perform
query-anchoring via fusion of retrieved results, and Rerank [Rocchio1], are
more robust than Rocchio3, that performs query-anchoring at the relevance
model level, but yield lower MAP score to a statistically significant degree.
We can also note that the methods based on Rocchio3 are always more robust
than their corresponding methods based on Rocchio1 — a model which does
not perform an interpolation with the original query model. In terms of
robustness, Rerank [Rocchio3] achieves the highest performance over most
corpora.
Most observations that where made for P@5 in Table 2 apply for Table 3
as well. Among these we can mention the trade-off between performance
and robustness, the observation that all three fusion-based methods are more
robust than the model they incorporate, and that Rerank is the most robust
among all three fusion-based methods and the incorporated model itself.
33

6.2.1.2 Analyzing Robustness Improvement due to Score-based
Fusion Methods
Insofar we saw that score-based fusion methods improve robustness when
applied on RM1, RM3 Rocchio1 and Rocchio3. In this section we investigate
the robustness improvement of these models with respect to their properties.
Speciﬁcally, whether they weigh documents with respect to p(d|q) — RM1
and RM3, and whether they perform query-anchoring at the model level
(pre-retrieval query-anchoring) — RM3 and Rocchio3 (see Section 3.2).
Analyzing Figures 2 and 3, we see that RM3, which weighs documents
with respect to p(d|q) as well as performs query-anchoring at the model level,
is the most robust expansion-based model, followed by Rocchio3 which only
performs query-anchoring and by RM1 which only weighs documents with
respect to p(d|q). Rocchio1 which does not weigh documents with respect to
p(d|q) nor performs query-anchoring is the least robust among the expansion-
based models. Another observation based on Figures 2 and 3 is that when
applying CombMNZ and Interpolation on the expansion-based models the
robustness is improved throughout all corpora and the above robustness order
is kept.
Figure 4 and Figure 5 compare the robustness improvement of the diﬀer-
ent expansion-based models when CombMNZ and Interpolation are applied
on them, respectively. As can be seen, CombMNZ yields a greater improve-
ment over the models than Interpolation does. Another observation is that
for CombMNZ and Interpolation alike, the robustness improvement of mod-
els that do not perform query-anchoring at the model level outperforms the
robustness improvement of models that do as could be expected.
34

Figure 4: Robustness improvement posted by CombMNZ
0
0.1
0.2
0.3
0.4
0.5
0.6
%ofrobustnessImprovement
TREC1-3 ROBUST WSJ SJMN AP AVERAGE
Corpus
RM1
RM3
Rocchio1
Rocchio3
Figure 5: Robustness improvement posted by Interpolation
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
%ofrobustnessImprovement
TREC1-3 ROBUST WSJ SJMN AP AVERAGE
Corpus
RM1
RM3
Rocchio1
Rocchio3
35

6.2.1.3 The Effect of the Query-Anchoring Parameter Used in
RM3 on the Performance of RM3 and CombMNZ [RM3]
In this section we investigate the effect of the query-anchoring parameter
used in RM3 on the performance of RM3 and CombMNZ [RM3]. Figure 6
presents the performance of RM3 and CombMNZ [RM3] where λ — the pa-
rameter that controls query-anchoring in the RM3 model (see Section 3.2) —
is set to a value in [0,0.1,. . . ,1], where increasing lambda increases anchoring.
As can be seen, up to a specific point, the MAP performance of RM3 usually
improves as λ decreases. This point marks the maximum MAP performance,
which is usually observed when λ is 0.2 or 0.1. The MAP performance of
CombMNZ [RM3] usually improves as λ decreases, and the best MAP perfor-
mance is usually observed when λ=0 (i.e., no query-anchoring). For both low
and high values of λ the MAP performances of RM3 and CombMNZ [RM3]
is similar. In terms of robustness, it can be seen that for most values of λ
CombMNZ [RM3] is more robust that RM3 (but has lower MAP), and the
gap between them increases as the values of λ decreases. This growing gap
can be attributed to the fact that RM3 suffers from a much greater decline
in robustness as the λ value grows smaller. In conclusion, we see in Figure 6
the performance-robustness trade-off, where the performance robustness im-
proves while the MAP performance decreases (as the query-anchoring in the
RM3 model increases).
36

Figure 6: RM3 and CombMNZ [RM3] MAP and robustness performance when
varying RM3’s query-anchoring parameter
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.125
0.15
0.175
0.2
0.225
0.25
0.275
0.3
0.325
0.35
0.375
0.4
MAP
init
λ
TREC1-3
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.125
0.15
0.175
0.2
0.225
0.25
0.275
0.3
0.325
0.35
0.375
0.4
MAP
init
λ
ROBUST
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.125
0.15
0.175
0.2
0.225
0.25
0.275
0.3
0.325
0.35
0.375
0.4
MAP
init
λ
WSJ
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.125
0.15
0.175
0.2
0.225
0.25
0.275
0.3
0.325
0.35
0.375
0.4
MAP
init
λ
SJMN
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.125
0.15
0.175
0.2
0.225
0.25
0.275
0.3
0.325
0.35
0.375
0.4
MAP
init
λ
AP
RM3-Map
RM3-robustness
RM3-combmnz-Map
RM3-combmnz-robustness
37

6.2.1.4 Comparison with a Cluster-Based Method
Table 4 represents a comparison of our methods with a cluster-based re-
sampling method for pseudo-relevance-feedback (Clusters) [44]. This method
constructs a relevance model using cluster-based document re-sampling from
Dinit, rewarding documents that appear in many overlapping clusters. The
values of the cluster-based model’s free parameters were selected to opti-
mize the MAP performance. The number of closest documents used to cre-
ate a cluster was selected from the set {5, 10} and the amount of clusters
was chosen from the set {10, 20, ..., num_of_docs}, where num_of_docs
is the amount of documents that achieved an optimized MAP performance
for RM3. As can be seen in the table, for the majority of corpora (WSJ,
SJMN and AP) the cluster-based method achieves the highest MAP, while
for TREC1-3 and ROBUST RM3 has the highest MAP. Our methods are gen-
erally more robust than Clusters and RM3, but following the performance-
robustness trade-off, have lower MAP, which is statistically significant for
CombMNZ and Rerank. Among all tested methods Rerank [RM3] is the
most robust method for most corpora but has the lowest MAP.
Table 4: Score-based methods vs. cluster-based method [44] applied for
Best_MAP RM3
MAP
Init
MAP
Init
MAP
Init
MAP
Init
MAP
Init
Init Rank 14.9 - 25.0 - 27.8 - 18.9 - 22.2 -
RM3 20.0i 28.7 30.0i 28.1 34.8i 20.0 24.6i 29.0 29.1i 28.3
Interpolation RM3 19.6ir 22.7 29.3ir 27.7 33.8ir 22.0 23.9ir 24.0 28.7i 27.3
CombMNZ RM3 17.9irc 16.7 27.1irc 19.3 30.7irc 18.0 21.6irc 23.0 26.5irc 16.2
Rerank RM3 16.9irc 22.7 25.5irc 15.3 28.4irc 14.0 19.9irc 11.0 25.1irc 12.1
Clusters 19.8i 31.3 29.9i 32.9 35.0i 26.0 25.0i 31.0 29.4i 28.3
Performance numbers of the initial ranking that is based on using only the original
query, the Best_MAP setting of RM3, the score-based methods applied on RM3 and the
cluster-based method presented by Lee et al. (Clusters) [44]. Boldface: best result per
sub-column; i r and c indicate statistically significant MAP differences with the
initial ranking, RM3 and Clusters respectively.
38

6.2.1.5 Robust_MAP and Robust_P@5 Settings
Insofar we investigated the improvement of robustness when our methods
were applied on the best performance expansion-based settings. One might
suspect that the robustness improvements observed so far could potentially
be attributed to the fact that we worked with performance-optimized set-
tings; and thus, we turn now to explore the performance of our methods
when applied on robustness-optimized settings. Tables 5 - 6 present the
performance of the score-based fusion methods, when applied on the most
robust expansion-based settings — Robust_MAP and Robust_P@5. As
can be seen, both in terms of MAP and P@5 , not only do the score-based
fusion methods improve robustness, when applied on the best performance
expansion-based settings (see Figure 2), but they improve robustness when
applied on the most robust expansion-based settings as well (with the excep-
tion of CombMNZ [RM3]). Moreover, for all score-based fusion methods the
MAP performance is better than that of the initial list.
We see that in most cases the fusion based methods applied on a model
which performs interpolation with the original query model (RM3 and Roc-
chio3) are more robust than those applied on a model which does not (RM1
and Rocchio1 respectively), but they also tend to have a lower MAP and a
lower P@5. This ﬁnding is in line with the performance-robustness trade-oﬀ
we demonstrated in previous sections (e.g., 6.2.1.1).
We can also see in Table 5 that among the fusion based methods applied
on RM1, Interpolation is the most robust, while Rerank is the most robust
among the methods applied for RM3, and in the whole table, as was the case
when applied for the best performance expansion-based setting (see Figure 2).
39

Table5:Score-basedfusionmethodsappliedforRobust_MAPandRobust_P@5onRM1andRM3
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
RM119.1i37.342.022.027.5i45.448.423.333.1i32.056.020.023.6i35.039.6i15.028.4i34.350.918.2
Interpolation
[RM1]
15.5ir13.338.10.725.6i13.747.72.028.9ir16.052.00.019.4ir16.033.4r0.023.2ir9.145.50.0
CombMNZ
[RM1]
18.1i23.338.43.328.0i28.548.35.630.8ir20.051.62.021.5i22.035.0ir3.026.8i21.245.1r4.0
Rerank[RM1]17.4ir28.738.40.726.3i30.948.22.429.8ir22.051.6220.4ir16.033.6r1.025.8ir19.245.31.0
RM315.3i9.338.31.327.5i20.148.02.431.8i14.053.22.019.3i14.034.4i1.022.8i9.145.91.0
Interpolation
[RM3]
14.9ir7.338.00.725.2ir17.748.00.428.4ir12.052.00.019.0ir9.033.40.022.4ir6.145.50.0
CombMNZ
[RM3]
15.1ir10.037.91.326.0ir21.747.81.629.6ir18.051.62.019.1ir16.033.81.022.5ir8.145.51.0
Rerank[RM3]15.0ir6.737.90.025.1ir11.647.80.028.0ir8.051.20.019.0ir2.033.0r0.022.3ir4.045.50.0
Performancenumbersoftheinitialrankingthatisbasedonusingonlytheoriginalquery,theRobust_MAPandRobust_P@5
settingsoftherelevancemodelsRM1andRM3,andthescorebasedfusionmethods.Boldface:bestresultpersub-column;
iandrindicatestatisticallysignificantMAPorP@5differenceswiththeinitialrankingandtheincorporatedmodel
(RM1orRM3),respectively.
Table6:Score-basedfusionmethodsappliedforRobust_MAPandRobust_P@5onRocchio1andRocchio3
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
Rocchio118.6i36.743.1i21.326.950.644.2i34.130.440.055.622.022.3i36.037.622.029.3i30.351.7i19.2
Interpolation
[Rocchio1]
15.7ir17.315.7r17.325.5i17.325.5r17.330.2i20.030.220.019.6ir20.019.620.023.3ir10.123.310.1
CombMNZ
[Rocchio1]
18.3i28.018.328.027.6i28.927.6r28.930.8i24.030.824.022.1i24.022.1i24.027.5ir18.227.5i18.2
Rerank
[Rocchio1]
17.1ir28.717.1r28.726.3i34.926.3r34.928.6i34.028.634.020.1i16.020.116.026.2ir17.226.2r17.2
Rocchio315.5i10.038.11.325.2i22.148.01.632.2i14.052.00.019.7i14.034.0i0.023.2i9.145.70.0
Interpolation
[Rocchio3]
15.5ir10.037.90.025.0ir11.248.00.431.2ir16.051.60.019.0i13.034.0i0.022.4ir7.145.70.0
CombMNZ
[Rocchio3]
15.3ir10.737.90.725.1ir17.748.00.429.2ir16.051.60.019.3i16.033.60.022.8ir12.145.50.0
Rerank
[Rocchio3]
15.0ir4.037.90.025.0ir1.647.80.028.0ir16.051.20.019.0i1.033.0r0.022.4ir5.145.50.0
Performancenumbersoftheinitialrankingthatisbasedonusingonlytheoriginalquery,theRobust_MAPandRobust_P@5
settingsofthemodelsRocchio1andRocchio3,andthescorebasedfusionmethods.Boldface:bestresultpersub-column;
iandrindicatestatisticallysignificantMAPorP@5differenceswiththeinitialrankingandtheincorporatedmodel
(Rocchio1andRocchio3),respectively.
40

6.2.1.6 Re-ranking Methods
In this section we analyze the performance of the three re-ranking meth-
ods: Rerank, Rev_rerank and Centroid_rerank (see Section 4.2.2). While
CombMNZ and Interpolation, as were examined in previous sections, rely on
co-occurrence of documents in retrieved lists, the re-ranking methods re-order
the (top) retrieved results of one retrieved method by the scores assigned by
another. Rerank and Rev_rerank use the initial list and the expansion-base
list. Specifically, Rerank re-orders the (top) pseudo-feedback-based retrieval
results by the initial scores of documents, while Rev_rerank re-orders the
(top) initial retrieval results by the pseudo-feedback-based scores of docu-
ments. The former method’s main concept is to anchor the documents in the
expansion-based list to the initial query, by using their initial scores, while
the latter method’s main concept is to prevent query-drift by considering only
documents which appear in the initial list, and using the query expansion
method for re-ordering them. As can be seen in Table 7, the Rerank method
is more robust than the Rev_rerank method, and both are more robust
than the model they incorporate. However, their MAP performance suffers
a penalty, as can be expected from the performance-robustness trade-off.
In terms of P@5, a very small difference was noted in both performance
and robustness between Rev_rerank and the model it incorporates, while
a much bigger difference was observed between Rerank and the models it
incorporates.
When examining the effect of applying Rerank and Rev_rerank on the
expansion-based models, we come to an interesting observation. While ap-
plying Rev_rerank on the models keeps the order in terms of MAP perfor-
mance intact, for Rerank the opposite is true. Namely, the MAP score of
RM3 and Rocchio3 is better than that of RM1 and Rocchio1, respectively,
as is the case when Rev_rerank is applied on the models. However, when
Rerank is applied, the opposite behavior is observed, although in this case
the differences are relatively small — i.e., the Rerank method is relatively
41

indifferent (in terms of MAP) to the model it incorporates. This behavior
calls for further research and examination. One hypothesis may state that
the content of the expansion-based lists is relatively similar and that the ma-
jor difference between the lists comes from the rankings, derived from the
difference in the scores that the expansion-based models assign to the doc-
uments. Hence, Rerank which re-orders the pseudo-feedback-based retrieval
results by the initial scores method is relatively indifferent to the model it
incorporates. Rev_rerank however, re-orders the initial retrieval results by
the pseudo-feedback-based scores of documents and therefore is more affected
by the incorporated model.
Unlike Rerank and Rev_rerank, Centroid_rerank uses two expansion-
based models, as it re-orders the (top) pseudo-feedback-based retrieval re-
sults, created using RM1, by the centroid it creates from the initial docu-
ments using the Rocchio1 model. RM1 can be viewed as a model that creates
a weighted centroid while Rocchio1 can be viewed as a model that creates a
non-weighted centroid (see Section 4.2.2). As can be seen in Table 8, Cen-
troid_rerank has (usually) a lower performance and is less robust than the
model it incorporates (RM1). Analyzing these results leads us to believe that
performing re-ordering using a non-weighted centroid on a list resulted from
a weighted-centroid method may not succeed in preventing the drift. The
other re-ordering direction is worth testing in future work.
42

Table7:Rerankvs.Rev_rerankappliedforBest_MAPandBest_P@5
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
RM119.2i38.744.4i25.327.5i45.449.223.733.2i34.056.422.024.1i37.039.6i15.028.5i38.452.1i21.2
Rev_rerank[RM1]17.9ir33.344.824.727.7ir41.049.822.132.8i32.057.620.023.0i37.039.4i16.024.4ir23.247.9i11.1
RM320.0i28.744.8i24.030.0i28.150.614.534.8i20.058.0i14.024.6i29.039.6i15.029.1i28.352.1i17.2
Rerank[RM3]16.9ir22.738.4r0.025.5ir15.347.80.028.4ir14.051.2r0.019.9ir11.033.61.025.1ir12.145.5r0.0
Rev_rerank[RM3]18.1ir26.044.7i23.329.1ir31.350.614.533.9ir20.058.0i14.023.5ir31.039.4i16.026.5ir31.352.3i17.2
Rocchio119.2i40.743.1i21.326.951.845.734.931.7i42.055.622.023.2i45.039.225.029.4i32.352.3i21.2
Rerank[Rocchio1]17.4ir30.038.5r0.726.3i32.148.02.429.7i20.050.88.020.1i27.034.01.026.2ir17.245.9r0.0
Rev_rerank[Rocchio1]17.9ir34.743.2i21.326.3r48.245.635.331.2i40.055.622.022.2i40.039.225.026.6ir30.352.3i21.2
Rerank[Rocchio3]16.9ir22.738.30.025.6ir21.747.80.028.3ir12.051.22.020.0ir7.033.0r0.025.2ir10.145.7r0.0
Rev_rerank[Rocchio3]18.1ir30.743.124.728.0ir35.349.310.033.3ir24.059.616.023.0ir31.040.2i17.026.9ir25.352.9i20.2
settingsofthemodelsRM1,RM3,Rocchio1andRocchio3,andtheRerankandRev_rerankmethodsbasedonthem.
Boldface:bestresultpersub-column;iandrindicatestatisticallysignificantMAPorP@5differenceswiththeinitial
rankingandtheincorporatedmodel,respectively.
Table8:Re-rankingmethodsappliedforBest_MAPRM1
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
RM119.2i38.744.4i25.327.5i45.449.223.733.2i34.056.422.024.1i37.039.6i15.028.5i38.452.1i21.2
Rev_rerank[RM1]17.9ir33.344.824.727.7ir41.049.822.132.8i32.057.620.023.0i37.039.4i16.024.4ir23.247.9i11.1
Centroid_rerank18.1ir43.336.7r34.020.5ir70.738.5ir46.223.2ir66.046.8r42.021.8r44.037.027.026.7ir38.448.327.3
Performancenumbersoftheinitialrankingthatisbasedonusingonlytheoriginalquery,theBest_MAPsettingofRM1,
andthere-orderingmethods.Boldface:bestresultpersub-column;iandrindicatestatisticallysignificantMAPorP@5
differenceswiththeinitialrankingandtheincorporatedmodel,respectively.
43

6.2.2 Rank-Based Fusion Methods
In the previous section we analyzed the performance of score-based fusion
methods, while in this section we examine the impact on performance of
the rank-based fusion methods. As can be seen in Table 9, all rank-based
fusion methods attain better performance (both MAP and P@5) than that
of the initial ranking, as was the case for score-based fusion methods (see
Section 6.2.1.1).
There is one similitude between the query expansion models that perform
an interpolation with the original query (RM3 and Rocchio3), and another
between the models that do not (RM1 and Rocchio1). Looking at the results
of RM3 and Rocchio3 and those of rank-based method applied on them, we
note that the performance robustness trade-off is usually kept, where they
have the best MAP and P@5 performance, but all the models incorporating
them have better robustness. Looking at RM1 based methods, we note that
Interpolation [RM1] overcomes the performance robustness trade-off in term
of MAP, having better MAP and robustness than RM1. This is also the case
for Interpolation [Rocchio1] with respect to Rocchio1, but it is not the case
for RM3 nor for Rocchio3, where the Interpolation method acts according to
the performance-robustness trade-off.
Comparing the rank-based methods applied on RM1 and the correspond-
ing methods on RM3, we see that CombMNZ and bordaRank are the most
robust and usually have a similar robustness score. We also note that
CombMNZ and bordaRank follow the performance-robustness trade-off. When
applied on RM1 they have better (MAP and P@5) performance while when
applied on RM3 they are more robust. The Interpolation method overcomes
the trade-off, when applied for RM3 it usually has better performance and
it is also more robust. Comparing the rank-based methods applied on Roc-
chio1 and the corresponding methods applied on Rocchio3, we note that
CombMNZ and bordaRank follow the performance-robustness trade-off, sim-
ilar to when applied on RM1 and RM3.
44

Table9:Rank-basedfusionmethodsappliedforBest_MAPandBest_P@5
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
MAP
Init
P@5
Init
InitRank14.9-37.9-25.0-47.8-27.8-51.2-18.9-33.0-22.2-45.5-
RM119.2i38.744.4i25.327.5i45.449.223.733.2i34.056.422.024.1i37.039.6i15.028.5i38.452.1i21.2
InterpolationrankSim
[RM1]
19.5ir30.743.921.328.8ir35.750.115.734.0ir30.059.6r14.024.3ir33.038.4i17.028.8ir33.350.5i16.2
CombMNZrankSim
[RM1]
18.9i21.343.115.328.6ir31.749.215.733.4i22.056.418.023.3i21.037.6i13.027.8i28.349.9i19.2
bordaRank[RM1]19.0i22.043.115.328.7ir30.949.015.733.5i22.056.418.023.5i21.037.6i13.027.9i28.350.1i19.2
RM320.0i28.744.8i24.030.0i28.150.614.534.8i20.058.0i14.024.6i29.039.6i15.029.1i28.352.1i17.2
[RM3]
19.8i24.043.9i19.329.5ir24.950.413.734.2ir20.058.0i12.024.2i26.038.4i17.028.8i28.351.9i16.2
CombMNZrankSim
[RM3]
18.3ir17.342.1i15.327.7ir22.549.210.831.6ir16.056.410.022.4ir21.037.6i13.026.9ir23.248.5r14.1
bordaRank[RM3]18.3ir18.042.1i15.327.8ir22.949.012.031.7ir16.056.410.022.6ir20.037.6i13.027.1ir24.248.7r14.1
Rocchio119.2i40.743.1i21.326.951.845.734.931.7i42.055.622.023.2i45.039.225.029.4i32.352.3i21.2
[Rocchio1]
19.5ir36.041.922.728.1ir37.848.8r12.033.0i26.057.2i14.023.3ir42.038.6r25.029.7ir26.351.5i20.2
CombMNZrankSim
[Rocchio1]
18.8i27.341.121.327.9ir36.947.626.932.7i24.053.226.022.4i32.036.818.028.2i23.249.120.2
bordaRank[Rocchio1]18.9i29.341.221.328ir35.747.926.932.7i24.052.82622.5i31.036.619.028.3i22.249.120.2
[Rocchio3]
19.7i31.338.0i0.028.8i34.947.80.033.5i22.052.0r0.023.8i28.034.0ir0.029.5i18.245.7i0.0
CombMNZrankSim
[Rocchio3]
18.3ir23.338.0i0.027.5ir26.947.90.831.6ir20.051.60.022.2ir22.033.6ir0.027.2ir16.245.7i0.0
bordaRank[Rocchio3]18.4ir24.738.0i0.027.7ir26.547.90.831.7ir20.051.60.022.4ir21.033.6r0.027.4ir17.245.7i0.0
settingsofthemodelsRM1andRM3Rocchio1andRocchio3,andtherank-basedfusionmethods.Boldface:bestresultper
sub-column;iandrindicatestatisticallysigniﬁcantMAPorP@5diﬀerenceswiththeinitialrankingandthemodelthey
incorporate(RM1,RM3,Rocchio1orRocchio3),respectively.
45

6.2.3 Score-Based and Rank-Based Fusion Methods Comparison
In previous sections we presented two types of fusion methods, score-based
and rank-based. Figure 7 and Figure 8 compare the performance of score-
based and rank-based CombMNZ and Interpolation respectively on the Best_MAP
setting of each expansion-based model.
We can observe that the score-based methods (both CombMNZ and Inter-
polation) are usually more robust than the corresponding rank-based meth-
ods, but have a lower MAP score. Another observation derived from Figure 7
is that both score-based and rank-based CombMNZ are more robust than
the model they incorporate, but usually have a lower MAP. While observ-
ing Figure 8 we note that like CombMNZ, both score-based and rank-based
Interpolation are more robust than the model they incorporate, but unlike
CombMNZ, when applied on RM1 and Rocchio1, both score-based and rank-
based Interpolation manage to overcome the inherent trade-oﬀ between MAP
and robustness by improving them both.
Another observation from Figures 7 and 8 is that the score-base methods
obtain a greater robustness improvement than the rank-based methods do,
and that both types of methods have the highest impact when applied on
RM1 and Rocchio1.
46

Figure 7: Score-based vs. rank-based CombMNZ
15
20
25
30
35
40
MAP
TREC1-3
RM1
TREC1-3
RM3
TREC1-3
Rocchio1
TREC1-3
Rocchio3
ROBUST
RM1
ROBUST
RM3
ROBUST
Rocchio1
ROBUST
Rocchio3
WSJ RM1 WSJ RM3 WSJ
Rocchio1
WSJ
Rocchio3
SJMN
RM1
SJMN
RM3
SJMN
Rocchio1
SJMN
Rocchio3
AP RM1 AP RM3 AP
Rocchio1
AP
Rocchio3
Corpus
CombMNZ applied on RM1, RM3, Rocchio1 and Rocchio3 - MAP
Query
expansion
model
CombMNZ
CombMNZ
rankSim
10
15
20
25
30
35
40
45
50
55
60
Init
TREC1-3
RM1
TREC1-3
RM3
TREC1-3
Rocchio1
TREC1-3
Rocchio3
ROBUST
RM1
ROBUST
RM3
ROBUST
Rocchio1
ROBUST
Rocchio3
WSJ RM1 WSJ RM3 WSJ
Rocchio1
WSJ
Rocchio3
SJMN
RM1
SJMN
RM3
SJMN
Rocchio1
SJMN
Rocchio3
AP RM1 AP RM3 AP
Rocchio1
AP
Rocchio3
Corpus
CombMNZ applied on RM1, RM3, Rocchio1 and Rocchio3- Robustness
Query
expansion
model
CombMNZ
CombMNZ
rankSim
47

Figure 8: Score-based vs. rank-based Interpolation
15
20
25
30
35
40
MAP
TREC1-3
RM1
TREC1-3
RM3
TREC1-3
Rocchio1
TREC1-3
Rocchio3
ROBUST
RM1
ROBUST
RM3
ROBUST
Rocchio1
ROBUST
Rocchio3
WSJ RM1 WSJ RM3 WSJ
Rocchio1
WSJ
Rocchio3
SJMN RM1 SJMN RM3 SJMN
Rocchio1
SJMN
Rocchio3
AP RM1 AP RM3 AP
Rocchio1
AP
Rocchio3
Corpus
Interpulation applied on RM1, RM3, Rocchio1 and Rocchio3 - MAP
Query
expansion
model
Interpulation
Interpulation
rankSim
10
15
20
25
30
35
40
45
50
55
60
Init
TREC1-3
RM1
TREC1-3
RM3
TREC1-3
Rocchio1
TREC1-3
Rocchio3
ROBUST
RM1
ROBUST
RM3
ROBUST
Rocchio1
ROBUST
Rocchio3
WSJ RM1 WSJ RM3 WSJ
Rocchio1
WSJ
Rocchio3
SJMN RM1 SJMN RM3 SJMN
Rocchio1
SJMN
Rocchio3
AP RM1 AP RM3 AP
Rocchio1
AP
Rocchio3
Corpus
Interpulation applied on RM1, RM3, Rocchio1 and Rocchio3- Robustness
Query
expansion
model
Interpulation
Interpulation
rankSim
48

7 Conclusions and Future Work
We addressed the performance robustness problem of pseudo-feedback-based
query expansion models. That is, the fact that for some queries using the
original query alone results in much better performance than that attained
by using the expanded form. The performance robustness problem is often
attributed to query drift — the change in intention between the original query
and the expanded form. Thus, we posed as a goal to potentially ameliorate
query drift; specifically, by using information induced from document-query
surface level similarities, as manifested, for example, in the ranking induced
by these similarities — i.e., the ranking of the corpus in response to the
original query.
One approach for ameliorating query drift that we have presented relies
on fusing the lists retrieved in response to the query and to its expanded form
so as to perform query-anchoring of the latter. The second approach is based
on re-ranking documents retrieved by pseudo-feedback-based retrieval using
their query similarity. Both approaches are based on the premise that docu-
ments retrieved in response to the expanded query, and which exhibit high
query similarity, are less prone to exhibit query drift. Indeed, we showed
empirically that such methods help to improve the robustness of pseudo-
feedback-based retrieval while slightly hurting the overall average perfor-
mance, albeit not to a statistically significant degree in most cases. This
average-performance/robustness trade-off rose in most of the methods that
we have presented.
We have also showed that our approaches can improve the performance
robustness of expansion models that perform query-anchoring at the model
level — i.e., when constructing the expansion form. Thus, pre-retrieval query-
anchoring performed at the model level, and post-retrieval query-anchoring
that is performed by our methods could be viewed as complementary. Fur-
thermore, our methods can improve the robustness of query expansion mod-
els that are tuned to optimize robustness. In addition, we showed that our
49

methods are more eﬀective in improving robustness than some previously
proposed approaches — e.g., a cluster-based method — but sometimes at
the cost of somewhat hurting overall average performance.
Additional exploration showed that our fusion-based approach is eﬀective
in improving performance robustness whether used upon retrieval scores or
upon ranks of documents. We hasten to point out that since pseudo-feedback-
based retrieval calls for two retrievals (one using the original query and one
using the expanded form), then the overall computational overhead incurred
by our fusion-based approach is quite minimal.
The overall performance and robustness of our approach demonstrate the
potential of this line of research, and hence, calls for future examination.
For example, one can explore more sophisticated fusion methods [81, 4, 49,
86]. Testing our methods upon additional query-expansion models is another
future venue worth exploring.
50

References
[1] Nasreen Abdul-Jaleel, James Allan, W. Bruce Croft, Fernando Diaz,
Leah Larkey, Xiaoyan Li, Marck D. Smucker, and Courtney Wade.
UMASS at TREC 2004 — novelty and hard. In Proceedings of the
Thirteenth Text Retrieval Conference (TREC-13), 2004.
[2] James Allan, Margaret E. Connell, W. Bruce Croft, Fang-Fang Feng,
David Fisher, and Xiaoyan Li. INQUERY and TREC-9. In Proceedings
of the Ninth Text Retrieval Conference (TREC-9), pages 551–562, 2000.
NIST Special Publication 500-249.
[3] Giambattista Amati, Claudio Carpineto, and Giovanni Romano. Query
diﬃculty, robustness, and selective application of query expansion. In
Proceedings of ECIR, pages 127–137, 2004.
[4] Javed A. Aslam and Mark Montague. Bayes optimal metasearch: a
probabilistic model for combining the results of multiple retrieval sys-
tems (poster session). In SIGIR ’00: Proceedings of the 23rd annual
international ACM SIGIR conference on Research and development in
information retrieval, pages 379–381, 2000.
[5] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Auto-
matic combination of multiple ranked retrieval systems. In SIGIR ’94:
Proceedings of the 17th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 173–181, 1994.
[6] Steven M. Beitzel, Ophir Frieder, Eric C. Jensen, David Grossman, Ab-
dur Chowdhury, and Nazli Goharian. Disproving the fusion hypothesis:
an analysis of data fusion via eﬀective information retrieval strategies.
In SAC ’03: Proceedings of the 2003 ACM symposium on Applied com-
puting, pages 823–827, 2003.
51

[7] Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan.
The eﬀect multiple query representations on information retrieval system
performance. In SIGIR ’93: Proceedings of the 16th annual international
ACM SIGIR conference on Research and development in information
retrieval, pages 339–346, 1993.
[8] Bodo Billerbeck and Justin Zobel. When query expansion fails. In
Proceedings of SIGIR, pages 387–388, 2003.
[9] Chris Buckley. Why current IR engines fail. In Proceedings of SIGIR,
pages 584–585, 2004. Poster.
[10] Chris Buckley and Donna Harman. Reliable information access ﬁnal
workshop report. Technical report, 2004.
[11] Chris Buckley and Mandar Mitra. Using clustering and superconcepts
within smart: Trec 6. pages 107–124, 1998.
[12] Chris Buckley, Mandar Mitra, Janet Walz, and Claire Cardie. Using
clustering and superconcepts within smart: Trec 6. Inf. Process. Man-
age., 36(1):109–131, 2000.
[13] Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. Auto-
matic query expansion using SMART: TREC3. In Proceedings of of the
Third Text Retrieval Conference (TREC-3), pages 69–80, 1994.
[14] Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. Se-
lecting good expansion terms for pseudo-relevance feedback. In SIGIR
’08: Proceedings of the 31st annual international ACM SIGIR confer-
ence on Research and development in information retrieval, pages 243–
250, 2008.
[15] Claudio Carpineto, Renato de Mori, Giovanni Romano, and Brigitte
Bigi. An information-theoretic approach to automatic query expansion.
ACM Transactions on Information Systems, 19(1):1–27, 2001.
52

[16] Chen ,S.F. and Goodman,J. An empirical study of smoothing techniques
for language modeling. In Tech.Rep.TR-10-98,Harvard University. in
Chengxiang Zhai and John D. Lafferty A Study of Smoothing Methods
for Language Models Applied to Ad Hoc Information Retrieval.
[17] Kevyn Collins-Thompson and Jamie Callan. Estimation and use of un-
certainty in pseudo-relevance feedback. In Proceedings of SIGIR, pages
303–310, 2007.
[18] Margaret Connell, Ao Feng, Giridhar Kumaran, Hema Raghavan, Chi-
rag Shah, and James Allan. UMass at TDT 2004. TDT2004 System
Description, 2004.
[19] W. Bruce Croft. A model of cluster searching based on classification.
Information Systems, 5:189–195, 1980.
[20] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. A language
modeling framework for selective query expansion. Technical Report
IR-338, Center for Intelligent Information Retrieval, University of Mas-
sachusetts, 2004.
[21] Padima Das-Gupta and Jeffrey Katzer. A study of the overlap among
document representations. In SIGIR ’83: Proceedings of the 6th annual
international ACM SIGIR conference on Research and development in
information retrieval, pages 106–114, 1983.
[22] Jean-Charles de Borda. Memoire sur les eletions au srutin. Histoire de
l’Aademie Royale des Sciences,Paris, 1781.
[23] Fernando Diaz and Donald Metzler. Improving the estimation of rel-
evance models using large external corpora. In Proceedings of SIGIR,
pages 154–161, 2006.
53

[24] Djoerd Hiemstra and Wessel Kraaij. Twenty-one at trec7: Ad hoc and
cross-language track. In In Proceeding of the seventh Text Retrieval
conference (TREC-7, pages 227–238, 1999.
[25] H.L. Fisher and D.R. Elchesen. Effectiveness of combining title words
and index terms in machine retrieval searches.
[26] Larry Fitzpatrick and Mei Dent. Automatic feedback using past queries:
social searching? In SIGIR ’97: Proceedings of the 20th annual inter-
national ACM SIGIR conference on Research and development in infor-
mation retrieval, pages 306–313, 1997.
[27] Edward A. Fox and Joseph A. Shaw. Combination of multiple searches.
In Proceedings of TREC-2, 1994.
[28] Alan Griffiths, H. Claire Luckhurst, and Peter Willett. Using interdoc-
ument similarity information in document retrieval systems. Journal
of the American Society for Information Science (JASIS), 37(1):3–11,
1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings
in Information Retrieval, Morgan Kaufmann, pp. 365–373, 1997.
[29] Donna Harman and Chris Buckley. The NRRC reliable information
access (RIA) workshop. In Proceedings of SIGIR, pages 528–529, 2004.
Poster.
[30] Donna K. Harman, editor. Overview of the first TREC conference, 1993.
[31] D. Thistlewaite P. Hawking and N. Craswell. Anu/acsys trec-6 experi-
ments. in proceedings of the 6th text retrieval conference(trec-6). NIST
Special Publication 500-240, pages 275–290, 1998.
[32] Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypoth-
esis: scatter/gather on retrieval results. In SIGIR ’96: Proceedings of
the 19th annual international ACM SIGIR conference on Research and
development in information retrieval, pages 76—84. ACM, 1996.
54

[33] N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in
information retrieval. Information Storage and Retrieval, 7(5):217–240,
1971.
[34] Jelinek, F. and Mercer, R. Interpolated estimation of markov sourcepa-
rameters from sparse data. In Tech.Rep.TR-10-98,Harvard University,
1980. In Pattern Recognition in Practice ,E.S .Gelsema and L.N. Kanal
,Eds. 381–402.
[35] S. Kullback and R. A. Leibler. On information and suﬃciency. The
Annals of Mathematical Statistics, 22(1):79–86, 1951.
[36] Oren Kurland and Lillian Lee. Corpus structure, language models, and
ad hoc information retrieval. In Proceedings of SIGIR, pages 194–201,
2004.
[37] Oren Kurland, Lillian Lee, and Carmel Domshlak. Better than the real
thing? Iterative pseudo-query processing using cluster-based language
models. In Proceedings of SIGIR, pages 19–26, 2005.
[38] John D. Laﬀerty and Chengxiang Zhai. Document language models,
query models, and risk minimization for information retrieval. In Pro-
ceedings of SIGIR, pages 111–119, 2001.
[39] Adenike M. Lam-Adesina and Gareth J. F. Jones. Applying summariza-
tion techniques for term selection in relevance feedback. In SIGIR ’01:
Proceedings of the 24th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 1–9, 2001.
[40] Victor Lavrenko and W. Bruce Croft. Relevance models in information
retrieval. pages 11–56.
[41] Victor Lavrenko and W. Bruce Croft. Relevance-based language models.
In Proceedings of SIGIR, pages 120–127, 2001.
55

[42] J.H. Lee. Combining multiple evidence from diﬀerent properties of
weighting schemes. In Proceedings of SIGIR, pages 180–188, 1995.
[43] J.H. Lee. Analyses of multiple evidence combination. In Proceedings of
SIGIR, pages 267–276, 1997.
[44] Kyung Soon Lee, W. Bruce Croft, and James Allan. A cluster-based
resampling method for pseudo-relevance feedback. In SIGIR ’08: Pro-
ceedings of the 31st annual international ACM SIGIR conference on Re-
search and development in information retrieval, pages 235–242. ACM,
2008.
[45] Kyung-Soon Lee, Kyo Kageura, and Key-Sun Choi. Implicit ambiguity
resolution using incremental clustering in cross-language information re-
trieval. Information Processing and Management, 40(1):145—159, 2004.
[46] Kyung-Soon Lee, Young-Chan Park, and Key-Sun Choi. Re-ranking
model based on document clusters. Information Processing and Man-
agement, 37(1):1–14, 2001.
[47] Xiaoyan Li. A new robust relevance model in the language model frame-
work. Inf. Process. Manage., 44(3):991–1007, 2008.
[48] Xiaoyan Li and W. Bruce Croft. Improving the robustness of relevance-
based language models. Technical Report IR-401, Center for Intelligent
Information Retrieval, University of Massachusetts, 2005.
[49] David Lillis, Fergus Toolan, Rem Collier, and John Dunnion. Probfuse:
a probabilistic approach to data fusion. In SIGIR ’06: Proceedings of
the 29th annual international ACM SIGIR conference on Research and
development in information retrieval, pages 139–146, 2006.
[50] Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using lan-
guage models. In Proceedings of SIGIR, pages 186–193, 2004.
56

[51] M. Lu, A. Ayoub and J. Dong. Ad hoc experiments using eureka. in nist
special publication 500-238: The 5th text retrieval conference(trec-5).
1997.
[52] Thomas R. Lynam, Chris Buckley, Charles L. A. Clarke, and Gordon V.
Cormack. A multi-system analysis of document and term selection for
blind feedback. In CIKM ’04: Proceedings of the thirteenth ACM inter-
national conference on Information and knowledge management, pages
261–269, 2004.
[53] McGill M. Koll M. and Noreault T. An evaluation of factors aﬀecting
document ranking by information retrieval systems.
[54] Hiriko Mano and Yasushi Ogawa. Selecting expansion terms in auto-
matic query expansion. In Proceedings of SIGIR, pages 390–391, 2001.
Poster.
[55] Donald Metzler, Fernando Diaz, Trevor Strohman, and W. Bruce Croft.
Using mixtures of relevance models for query expansion. In Proceedings
of the Fourteenth Text Retrieval Conference (TREC), 2005.
[56] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden
Markov model information retrieval system. In Proceedings of SIGIR,
pages 214–221, 1999.
[57] Mandar Mitra, Amit Singhal, and Chris Buckley. Improving automatic
query expansion. In Proceedings of SIGIR, pages 206–214, 1998.
[58] Jesse Montgomery, Luo Si, Jamie Callan, and David A. Evans. Eﬀect
of varying number of documents in blind feedback, analysis of the 2003
NRRC RIA workshop bf_numdocs experiment suite. In Proceedings
of SIGIR, pages 476–477, 2004. Poster.
[59] Oren Kurland Natali Soskin and Carmel Domshlak. Navigating in the
dark: Modeling uncertainty in ad hoc retrieval using multiple relevance
57

Query-drift prevention for robust query expansion

Query-drift prevention for robust query expansion

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Query-drift prevention for robust query expansion

Similar to Query-drift prevention for robust query expansion (20)

Query-drift prevention for robust query expansion