Slides of my presentation at ICTIR'18
September, 17 2018
Tianjin, China
Paper: https://dl.acm.org/citation.cfm?id=3234958
Version with formal proofs: https://arxiv.org/abs/1807.04317
A Formal Account of Effectiveness Evaluation and Ranking Fusion
1. A Formal Account of
Effectiveness
Evaluation and
Ranking Fusion
Enrique Amigó, Fernando Giner,
Stefano Mizzaro, Damiano Spina
ICTIR’18, Tianjin, China
2. Introduction
• Known statements and empirical observations in the literature
• Top-heaviness Principle: Highly-ranked documents have more weight in the
evaluation process [Busin and Mizzaro, 2013]
• Ranking Fusion Effectiveness: Unsupervised ranking fusion outperforms
single rankings [Lee, 1995; Montague and Aslam, 2002; Vogt and Garrison,
1998;Fu, 2012; Kurland and Culpepper, 2018]
Research Question: Can we model these phenomena in a
common theoretical framework?
3. Intuition
• Observations of how documents are retrieved by a given set of signals
• E.g., Document d is unanimously ranked higher than d’ by all given
systems/rel. judgments
• Quantify the information captured by those observations
• Define an entropy-like notion that allows the formalization of system
effectiveness and ranking fusion
4. An Example
How many times a document is unanimously outscored
by other documents in Γ?
"# is only unanimously outscored by itself ("#)
"$ is unanimously outscored by "# and "$
"% is unanimously outscored by "%
"& is unanimously outscored by "# and "&
"', "(, … are unanimously outscored by all documents in D
A document, ", is unanimously outscored by another document, "′,
according to a set of signals, Γ, whenever it is outscored for every
signal simultaneously: "* ≥, " ⟺ ∀/ ∈ Γ. / "* ≥ / "
• Set of signals Γ = {4#, 4$, 4%, 6}
(rankings + human assessments)
• Collection 8 with a large
amount of documents
• Documents not retrieved share
the same infinite rank.
5. Observational
Information Quantity
(OIQ)
• The Observational Information
Quantity, !"($), of a document, &,
under a set of signals, Γ , is the minus
logarithm of the probability of being
unanimously outscored by other
documents in (
where
6. An Example
• Set of signals Γ = {$%, $', $(, )}
(rankings + human assessments)
• Collection D with a large
amount of documents
• Documents not retrieved share
the same infinite rank.
7. Observational Entropy
• We know how to quantify information of
observation of documents
• How do we measure the information quantity
captured in a set of signals?
• The observational entropy of a given a set of
signals captures how unlikely is to find
unanimous improvement:
• If we compare a ranking against the ground
truth: the lower the entropy, the more the
ranking is similar to the ground truth
• By taking the average OIQ of the
documents the signals retrieve (")
• Observational Entropy
8. Summary so far
• Intuition
• Documents higher in the rank provide/carry more information
• Quantified with OIQ (Observational Information Quantity)
• Observational Entropy
9. Properties
• OIQ of a document under a single signal γ grows with its signal value or
score
• The observational entropy of a single ranking γ depends exclusively on its
length
• Both observational entropy and OIQ do not decrease when adding more
signals to the set Γ
• Observational entropy and OIQ are invariant under redundant signals.
• If a preference between two documents in γ is not corroborated by any
signal in the set then the entropy strictly increases when adding the signal
to the set Γ
11. Measuring Effectiveness with OIQ
• Given a ranking ! and a ground-truth g:
• Observational Information Effectiveness (OIE):
• Linear combination of observational entropies of single and joint
signals
• Inspired by the Informational Contrast Model defined for text
similarity [Amigó et al. 2017; Hick, 1952]
• " !, $ captures how similar the ranking to the ground-truth is
12. How OIE explains effectiveness
• OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013]
• If β > 0:
• Priority: Swapping contiguous documents in concordance with the gold increases effectiveness
• Deepness: The effect of swapping is larger at the top of the ranking
• Deepness Threshold: Retrieving one relevant document is better than a huge amount of relevant
documents after a huge set of irrelevant documents
• If 1 < β <
9:;<
:
:
• Closeness Threshold: there exists a certain area at the top of the ranking in which n relevant
documents is better than only one (the user always inspects at least the n first documents)
• If α< > 0 and B > α<:
• Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
13. How OIE explains effectiveness
• OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013]
• If β > 0:
• Priority: Swapping contiguous documents in concordance with the gold increases effectiveness
• Deepness: The effect of swapping is larger at the top of the ranking
• Deepness Threshold: Retrieving one relevant document is better than a huge amount of relevant
documents after a huge set of irrelevant documents
• If 1 < β <
9:;<
:
:
• Closeness Threshold: there exists a certain area at the top of the ranking in which n relevant
documents is better than only one (the user always inspects at least the n first documents)
• If α< > 0 and B > α<:
• Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
15. Ranking Fusion with OIQ
Experiment
• Gov-2 collection and the topics 701 to 750 used in the TREC 2004 Terabyte
Track
• 60 official runs, top 100 documents in the rankings.
• Random sample of test cases: 1 topic; Γ = 5 runs; " = 1 run from Γ
• Assumption: Adding signals increases the probability of estimating
relevance under an OIQ increase
16. Ranking Fusion with OIQ
0.75
0.80
0.85
0.90
0.95
1.00
0.75 0.80 0.85 0.90 0.95 1.00
P(d ≥ gd′|d ≥ γd′)
P(d≥gd′|d≥IΓ
d′)
• X-axis: Probability of relevance
estimated by a single signal
• Y-axis: Probability of relevance
estimated by a set of signals
Adding signals helps most of the times!
In the paper: OIQ is (only!) as effective as Borda count
17. Summary
• Can we explain phenomena in IR such as effectiveness measurement and ranking fusion with a
common theoretical framework?
• Observational Information Quantity (OIQ)
• Documents are more likely to be relevant as higher the information quantity of their
observations (in signals) is
• An evaluation measure derived by this framework (OIE) satisfies formal constraints for ranking
effectiveness
• Using OIQ as ranking fusion method outperforms single signals and performs similarly to (but not
better than) other ranking fusion methods (Borda Count)
Future work: Does OIQ explain other IR phenomena?
GLARE CIKM’18 workshop paper with preliminary resultsEvaluation without human assessments?
Weak supervision?
18. A Formal Account of
Effectiveness
Evaluation and
Ranking Fusion
Enrique Amigó, Fernando Giner,
Stefano Mizzaro, Damiano Spina
http://bit.ly/ObservationalInformationQuantity_proofs
Formal proofs available at: