Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.
1. Extending BM25 with multiple query
operators
Roi Blanco
Yahoo! Research
Paolo Boldi
Universitá degli studi di Milano
2. Overview
• Motivation
• Virtual regions and BM25’s event space
• Experiments
• Conclusions
3. Motivation
• From document understanding to query
understanding
• Query rewrites (gazetteers, spell
correction), named entity recognition,
query suggestions, query categories, query
segmentation ...
• Detecting query intent, triggering verticals
• direct target towards answers
• richer interfaces
4. Signals for Ranking
• Signals for ranking: matches of query terms
in documents, query-independent quality
measures, CTR, among others
• Probabilistic IR models are all about
counting
• occurrences of terms in documents, in
sets of documents, etc.
• How to aggregate efficiently a large number
of “different” counts
• coming from the same terms
• no double counts!
5. Searching for food
• New York’s greatest pizza
‣ New OR York’s OR greatest OR pizza
‣ New AND York’s AND greatest AND
pizza
‣ New OR York OR great OR pizza
‣ “New York” OR “great pizza”
‣ “New York” AND “great pizza”
‣ York < New AND great OR pizza
• among many more.
6. “Refined” matching
• Extract a number of virtual regions in the
document that match some version of the
query (operators)
• Each region provides a different evidence
of relevance (i.e. signal)
• Aggregate the scores over the different
regions
• Ex. :“at least any two words in the query
appear either consecutively or with an extra
word between them”
8. R
Et
tft
t ! q
BM25
• Term (tf) independence
• Vague Prior over terms not
appearing in the query
• Eliteness - topical model that
perturbs the word
distribution
• 2-poisson distribution of
term frequencies over
relevant and non-relevant
documents
Figure 2: BM25 plate diagram representing the over variable independence
wBM25
t,d =
according to p(r | Q = q,D = d), or equivalently
p(r | Q = q,D = d)
p(r | Q = q,D = d) !q
p(D = d | r,Q = q)
p(D = d | r,Q = q)
Y
tft,ˆd
widf
tft,ˆd + k1 · t
9. Virtual regions
• Different parts of the
document provide
different evidence of
relevance
• Create a (finite) set of
(latent) artificial regions
and re-weight
p(tfj
t,d = x | r) =
!
e!{0,1}
p(tfj
t,d = x | e, r)p(e | r) =
!
q!Q
!
e,!!{0,1}
p(tfj
t,d = x | !j = !, e, r)p(!j = ! | e, r)p(e | r)
Et
!j
j !M
t ! q
R
tfj
t
10. Implementation
• An operator maps a query to a set of
queries, which could match a document
• Each operator has a weight
• The average term frequency in a document
is
ˆ tf t,d =
!
j!M
wj · tf j
t,d
(1 − bj) + bj · |d|/avdl
.
11. Remarks
• Different saturation (eliteness) function?
• learn the real functional shape!
• log-logistic is good if the class-conditional
distributions are drawn from an exp.
family
• Positions as variables?
• kernel-like method or exp. #parameters
• Apply operators on a per query or per
query class basis?
12. Operator examples
• BOW: maps a raw query to the set of
queries whose elements are the single
terms
• p-grams: set of all p-gram of consecutive
terms
• p-and: all conjunctions of p arbitrary
terms
• segments: match only the “concepts”
• Enlargement: some words might sneak in
between the phrases/segments
14. Research questions
• R1: How do the operators work
individually combined with BM25(F)?
• R2: How does the combination of
operators work?
• R3: Is there a difference in difficult queries?
15. Single operators results
TERA04 TERA05 TERA06
MAP P@10 P@20 MAP P@10 P@20 MAP P@10 P@20
BM25 0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140
BM25F 0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160
p-AND p = 2 0.2679 0.5429 0.5286 0.3396* 0.6260 0.5920 0.3069 0.5900 0.5300
phrasal μ = 3 0.2673 0.5306 0.4939 0.3369* 0.6060 0.5790 0.3082* 0.5720 0.5290
segment μ = 1 0.2685* 0.5143 0.4970 0.3274 0.6080 0.5580 0.3180* 0.5860 0.5350
segment μ = 1.5 0.2695* 0.5347 0.5010 0.3272 0.6140 0.5620 0.3122* 0.5940 0.5460
segment μ = 3 0.2690* 0.5143 0.5204 0.3295* 0.6300 0.5840 0.3277* 0.5940 0.5460
p-grams p = 2,μ = 1 0.2786* 0.5530 0.5120 0.3294 0.5860 0.5470 0.3137* 0.5860 0.5470
p-grams p = 3,μ = 1 0.2670 0.5060 0.4950 0.3272 0.5980 0.5560 0.3198* 0.6160 0.5600
Table 2: Performance of single operators (* = statistical significance at p < 0.05 using a one-sided t-test with
respect to BM25, MAP only).
operator. The purpose of this experiment is to determine
whether the sigmoid-like term-frequency normalizing func-tion
is able to accommodate for different features which stem
from the same source of evidence (matches of the query term
in the document). Results are significantly better than the
baseline and outperform state-of-the-art ranking functions
that just use matching of query terms (note we are not
adding query-independent evidence, like link information,
click-through data, etc.). For instance, the best MAP value
on each of the TREC topics 701-850 and examined
the ten queries that produced the largest difference in pre-cision
between our system and the BM25 baseline. Table shows the queries that obtained the greatest benefit from the
use of segmentation and/or p-gram operators: in the right-most
column of the table, we identified the segments (or the
p-grams) that most contributed to the increased precision,
allowing to retrieve relevant documents that BM25 missed
(because they were ranked too low).
16. Combination of
operators
TERA04 TERA05 TERA06
MAP P@10 P@20 MAP P@10 P@20 MAP P@10 P@20
BM25 0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140
BM25F 0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160
BM25 + p-grams 0.2898* 0.5939 0.5531 0.3368* 0.6260 0.5800 0.3361* 0.6000 0.5048
BM25 + segment 0.2866* 0.5653 0.5306 0.3285 0.5980 0.5710 0.3188 0.5720 0.5100
BM25F + p-grams 0.2908* 0.5959 0.5541 0.3398* 0.6280 0.5830 0.3319* 0.5940 0.5350
BM25F + segment 0.2869* 0.5653 0.5306 0.3282* 0.5920 0.5510 0.3315* 0.5940 0.5350
Table 3: Performance of a combination of operators (* = statistical significance at p < 0.05 using a one-sided
t-test with respect to BM25, MAP only). Configuration for p-gram is p = 2, μ = {1, 2, 3}, and for segment is
μ = {1, 2, 3}.
WEB WEB-phrasal
MAP NDCG P@10 MAP NDCG P@10
BM25 0.1553 0.2728 0.3892 0.1350 0.2345 0.3133
BM25F 0.1722* 0.3008 0.4285 0.1575* 0.2842 0.3747
BM25 + p-grams 0.1822* 0.3126 0.4390 0.1750* 0.3078 0.4010
BM25 + segment 0.1810* 0.3126 0.4344 0.1725* 0.3029 0.4040
BM25F + p-grams 0.1854* 0.3216 0.4400 0.1769* 0.3123 0.4101
BM25F + segment 0.1842* 0.3190 0.4392 0.1749* 0.3165 0.4005
Table 4: Performance of a combination of operators on the WEB collection (* = statistical significance at
17. TERA04 TERA05 TERA06
Phrases vs regular
MAP P@10 P@20 MAP P@10 P@20 MAP P@10 P@0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140
0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160
queries
p-grams 0.2898* 0.5939 0.5531 0.3368* 0.6260 0.5800 0.3361* 0.6000 0.5048
segment 0.2866* 0.5653 0.5306 0.3285 0.5980 0.5710 0.3188 0.5720 0.5100
p-grams 0.2908* 0.5959 0.5541 0.3398* 0.6280 0.5830 0.3319* 0.5940 0.5350
segment 0.2869* 0.5653 0.5306 0.3282* 0.5920 0.5510 0.3315* 0.5940 0.5350
Performance of a combination of operators (* = statistical significance at p < 0.05 using a one-respect to BM25, MAP only). Configuration for p-gram is p = 2, μ = {1, 2, 3}, and for segment WEB WEB-phrasal
MAP NDCG P@10 MAP NDCG P@10
BM25 0.1553 0.2728 0.3892 0.1350 0.2345 0.3133
BM25F 0.1722* 0.3008 0.4285 0.1575* 0.2842 0.3747
BM25 + p-grams 0.1822* 0.3126 0.4390 0.1750* 0.3078 0.4010
BM25 + segment 0.1810* 0.3126 0.4344 0.1725* 0.3029 0.4040
BM25F + p-grams 0.1854* 0.3216 0.4400 0.1769* 0.3123 0.4101
BM25F + segment 0.1842* 0.3190 0.4392 0.1749* 0.3165 0.4005
Performance of a combination of operators on the WEB collection (* = statistical significance using a one-sided t-test with respect to BM25, MAP only). Configuration for p-gram is p for segment is μ = {1, 2, 3}.
on Recommender systems, pages 5–6.
[11] S. B¨uttcher, C. L. A. Clarke, and B. Lushman.
18. Examples (that worked)
Topic no. BM25 MAP Operators MAP Query Which segment / p-gram helped?
810 0.2759 0.5243 Timeshare resales “Timeshare resales”
750 0.0903 0.3251 John Edwards womens’ issues “John Edwards” and “womens’ issue”
806 0.1279 0.3446 Doctors Without Borders “Doctors Without Borders”
834 0.2819 0.4859 Global positioning system earthquakes “Global positioning”
745 0.1801 0.3828 Doomsday cults “Doomsday cult”
835 0.1306 0.3222 Big Dig pork “Big Dig”
732 0.1438 0.3309 U.S. cheese production “U.S.” and “cheese production”
723 0.1187 0.3030 Executive privilege “Executive privilege”
843 0.2420 0.4228 Pol Pot “Pol Pot”
785 0.2717 0.4323 Ivory billed woodpecker “Ivory billed”
Table 5: The ten queries that produced the largest gap between standard BM25 and our operator-based
scoring. The methods in this table use a combination of BM25, segments and 2-grams.
Research and development in information retrieval,
pages 797–798. ACM, 2010.
[20] Y. Li, B.-j. P. Hsu, C. Zhai, and K. Wang.
Unsupervised Query Segmentation Using Clickthrough
for Information Retrieval. Proceedings of the 34th
international ACM SIGIR conference on Research and
development in Information Retrieval, pages 285–294,
2011.
[21] T.-Y. Liu. Learning to Rank for Information Retrieval.
[30] S. Robertson. The probability ranking principle in IR.
Journal of documentation, 1977.
[31] S. Robertson. On event spaces and probabilistic
models in information retrieval. Information Retrieval,
8(2):319–329, 2005.
[32] S. Robertson and S. Walker. Some simple effective
approximations to the 2-poisson model for
probabilistic weighted retrieval. Proceedings of the
SIGIR conference on Research and development in
19. Conclusions
• Developed an extension for BM25 that
accounts for different query operators
(versions of the query)
• The main idea is to score individually virtual
regions that match each operator
• aggregate the evidence per query term
• squash the different signals
• Future work
• apply other query-dependent signals
• select the right operators for a query
21. Feature dependencies
• Class-linearly dependent (or affine) features
• add no extra evidence/signal
• model overfitting (vs capacity)
• Still, it is desirable to enrich the model with
more involved features
• Some features are surprisingly correlated
• Positional information requires a large
number of parameters to estimate
! O(|D||q|)
22. Query concept segmentation
• Queries are made up of basic conceptual
units, comprising many words
• “Indian summer victor herbert”
• Spurious matches: “san jose airport” -> “san
jose city airport”
• Model to detect segments based on
generative language models and Wikipedia
• Relax matches using factors of the max
ratio between span length and segment
length