SlideShare a Scribd company logo
1 of 18
Download to read offline
A Formal Account of
Effectiveness
Evaluation and
Ranking Fusion
Enrique Amigó, Fernando Giner,
Stefano Mizzaro, Damiano Spina
ICTIR’18, Tianjin, China
Introduction
• Known statements and empirical observations in the literature
• Top-heaviness Principle: Highly-ranked documents have more weight in the
evaluation process [Busin and Mizzaro, 2013]
• Ranking Fusion Effectiveness: Unsupervised ranking fusion outperforms
single rankings [Lee, 1995; Montague and Aslam, 2002; Vogt and Garrison,
1998;Fu, 2012; Kurland and Culpepper, 2018]
Research Question: Can we model these phenomena in a
common theoretical framework?
Intuition
• Observations of how documents are retrieved by a given set of signals
• E.g., Document d is unanimously ranked higher than d’ by all given
systems/rel. judgments
• Quantify the information captured by those observations
• Define an entropy-like notion that allows the formalization of system
effectiveness and ranking fusion
An Example
How many times a document is unanimously outscored
by other documents in Γ?
"# is only unanimously outscored by itself ("#)
"$ is unanimously outscored by "# and "$
"% is unanimously outscored by "%
"& is unanimously outscored by "# and "&
"', "(, … are unanimously outscored by all documents in D
A document, ", is unanimously outscored by another document, "′,
according to a set of signals, Γ, whenever it is outscored for every
signal simultaneously: "* ≥, " ⟺ ∀/ ∈ Γ. / "* ≥ / "
• Set of signals Γ = {4#, 4$, 4%, 6}
(rankings + human assessments)
• Collection 8 with a large
amount of documents
• Documents not retrieved share
the same infinite rank.
Observational
Information Quantity
(OIQ)
• The Observational Information
Quantity, !"($), of a document, &,
under a set of signals, Γ , is the minus
logarithm of the probability of being
unanimously outscored by other
documents in (
where
An Example
• Set of signals Γ = {$%, $', $(, )}
(rankings + human assessments)
• Collection D with a large
amount of documents
• Documents not retrieved share
the same infinite rank.
Observational Entropy
• We know how to quantify information of
observation of documents
• How do we measure the information quantity
captured in a set of signals?
• The observational entropy of a given a set of
signals captures how unlikely is to find
unanimous improvement:
• If we compare a ranking against the ground
truth: the lower the entropy, the more the
ranking is similar to the ground truth
• By taking the average OIQ of the
documents the signals retrieve (")
• Observational Entropy
Summary so far
• Intuition
• Documents higher in the rank provide/carry more information
• Quantified with OIQ (Observational Information Quantity)
• Observational Entropy
Properties
• OIQ of a document under a single signal γ grows with its signal value or
score
• The observational entropy of a single ranking γ depends exclusively on its
length
• Both observational entropy and OIQ do not decrease when adding more
signals to the set Γ
• Observational entropy and OIQ are invariant under redundant signals.
• If a preference between two documents in γ is not corroborated by any
signal in the set then the entropy strictly increases when adding the signal
to the set Γ
Measuring Effectiveness
with OIQ
Measuring Effectiveness with OIQ
• Given a ranking ! and a ground-truth g:
• Observational Information Effectiveness (OIE):
• Linear combination of observational entropies of single and joint
signals
• Inspired by the Informational Contrast Model defined for text
similarity [Amigó et al. 2017; Hick, 1952]
• " !, $ captures how similar the ranking to the ground-truth is
How OIE explains effectiveness
• OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013]
• If β > 0:
• Priority:	Swapping contiguous documents in concordance with the gold increases effectiveness
• Deepness:	The effect of swapping is larger at the top of the ranking
• Deepness Threshold:	Retrieving one relevant document is better than a huge amount of relevant
documents after a huge set of irrelevant documents
• If 1 < β <
9:;<
:
:
• Closeness Threshold:	there exists a certain area at the top of the ranking in which n relevant
documents is better than only one (the user always inspects at least the n first documents)
• If	α< > 0 and	B > α<:
• Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
How OIE explains effectiveness
• OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013]
• If β > 0:
• Priority:	Swapping contiguous documents in concordance with the gold increases effectiveness
• Deepness:	The effect of swapping is larger at the top of the ranking
• Deepness Threshold:	Retrieving one relevant document is better than a huge amount of relevant
documents after a huge set of irrelevant documents
• If 1 < β <
9:;<
:
:
• Closeness Threshold:	there exists a certain area at the top of the ranking in which n relevant
documents is better than only one (the user always inspects at least the n first documents)
• If	α< > 0 and	B > α<:
• Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
Ranking Fusion with OIQ
Ranking Fusion with OIQ
Experiment
• Gov-2 collection and the topics 701 to 750 used in the TREC 2004 Terabyte
Track
• 60 official runs, top 100 documents in the rankings.
• Random sample of test cases: 1 topic; Γ = 5 runs; " = 1 run from Γ
• Assumption: Adding signals increases the probability of estimating
relevance under an OIQ increase
Ranking Fusion with OIQ
0.75
0.80
0.85
0.90
0.95
1.00
0.75 0.80 0.85 0.90 0.95 1.00
P(d ≥ gd′|d ≥ γd′)
P(d≥gd′|d≥IΓ
d′)
• X-axis: Probability of relevance
estimated by a single signal
• Y-axis: Probability of relevance
estimated by a set of signals
Adding signals helps most of the times!
In the paper: OIQ is (only!) as effective as Borda count
Summary
• Can we explain phenomena in IR such as effectiveness measurement and ranking fusion with a
common theoretical framework?
• Observational Information Quantity (OIQ)
• Documents are more likely to be relevant as higher the information quantity of their
observations (in signals) is
• An evaluation measure derived by this framework (OIE) satisfies formal constraints for ranking
effectiveness
• Using OIQ as ranking fusion method outperforms single signals and performs similarly to (but not
better than) other ranking fusion methods (Borda Count)
Future work: Does OIQ explain other IR phenomena?
GLARE CIKM’18 workshop paper with preliminary resultsEvaluation without human assessments?
Weak supervision?
A Formal Account of
Effectiveness
Evaluation and
Ranking Fusion
Enrique Amigó, Fernando Giner,
Stefano Mizzaro, Damiano Spina
http://bit.ly/ObservationalInformationQuantity_proofs
Formal proofs available at:

More Related Content

What's hot

Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Lucidworks
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
feiwin
 
A Research Literature Search Engine With Abbreviation Recognition
A Research Literature Search Engine With Abbreviation RecognitionA Research Literature Search Engine With Abbreviation Recognition
A Research Literature Search Engine With Abbreviation Recognition
Hector Lin
 

What's hot (9)

Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
A Research Literature Search Engine With Abbreviation Recognition
A Research Literature Search Engine With Abbreviation RecognitionA Research Literature Search Engine With Abbreviation Recognition
A Research Literature Search Engine With Abbreviation Recognition
 
Tomorrows language technology
Tomorrows language technology Tomorrows language technology
Tomorrows language technology
 

Similar to A Formal Account of Effectiveness Evaluation and Ranking Fusion

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 

Similar to A Formal Account of Effectiveness Evaluation and Ranking Fusion (20)

IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
information retrieval
information retrievalinformation retrieval
information retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Action research workshop
Action research workshopAction research workshop
Action research workshop
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Ir models
Ir modelsIr models
Ir models
 
Contemporary research practices
Contemporary research practicesContemporary research practices
Contemporary research practices
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 

More from Damiano Spina

UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013
Damiano Spina
 

More from Damiano Spina (12)

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
 
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
 
Online Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access PerspectiveOnline Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access Perspective
 
Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...
 
UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013
 
Identifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog PostsIdentifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog Posts
 
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsTowards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
 
A Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog PostsA Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog Posts
 
Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...
 
Evaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuariosEvaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuarios
 
Caracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de casoCaracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de caso
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

A Formal Account of Effectiveness Evaluation and Ranking Fusion

  • 1. A Formal Account of Effectiveness Evaluation and Ranking Fusion Enrique Amigó, Fernando Giner, Stefano Mizzaro, Damiano Spina ICTIR’18, Tianjin, China
  • 2. Introduction • Known statements and empirical observations in the literature • Top-heaviness Principle: Highly-ranked documents have more weight in the evaluation process [Busin and Mizzaro, 2013] • Ranking Fusion Effectiveness: Unsupervised ranking fusion outperforms single rankings [Lee, 1995; Montague and Aslam, 2002; Vogt and Garrison, 1998;Fu, 2012; Kurland and Culpepper, 2018] Research Question: Can we model these phenomena in a common theoretical framework?
  • 3. Intuition • Observations of how documents are retrieved by a given set of signals • E.g., Document d is unanimously ranked higher than d’ by all given systems/rel. judgments • Quantify the information captured by those observations • Define an entropy-like notion that allows the formalization of system effectiveness and ranking fusion
  • 4. An Example How many times a document is unanimously outscored by other documents in Γ? "# is only unanimously outscored by itself ("#) "$ is unanimously outscored by "# and "$ "% is unanimously outscored by "% "& is unanimously outscored by "# and "& "', "(, … are unanimously outscored by all documents in D A document, ", is unanimously outscored by another document, "′, according to a set of signals, Γ, whenever it is outscored for every signal simultaneously: "* ≥, " ⟺ ∀/ ∈ Γ. / "* ≥ / " • Set of signals Γ = {4#, 4$, 4%, 6} (rankings + human assessments) • Collection 8 with a large amount of documents • Documents not retrieved share the same infinite rank.
  • 5. Observational Information Quantity (OIQ) • The Observational Information Quantity, !"($), of a document, &, under a set of signals, Γ , is the minus logarithm of the probability of being unanimously outscored by other documents in ( where
  • 6. An Example • Set of signals Γ = {$%, $', $(, )} (rankings + human assessments) • Collection D with a large amount of documents • Documents not retrieved share the same infinite rank.
  • 7. Observational Entropy • We know how to quantify information of observation of documents • How do we measure the information quantity captured in a set of signals? • The observational entropy of a given a set of signals captures how unlikely is to find unanimous improvement: • If we compare a ranking against the ground truth: the lower the entropy, the more the ranking is similar to the ground truth • By taking the average OIQ of the documents the signals retrieve (") • Observational Entropy
  • 8. Summary so far • Intuition • Documents higher in the rank provide/carry more information • Quantified with OIQ (Observational Information Quantity) • Observational Entropy
  • 9. Properties • OIQ of a document under a single signal γ grows with its signal value or score • The observational entropy of a single ranking γ depends exclusively on its length • Both observational entropy and OIQ do not decrease when adding more signals to the set Γ • Observational entropy and OIQ are invariant under redundant signals. • If a preference between two documents in γ is not corroborated by any signal in the set then the entropy strictly increases when adding the signal to the set Γ
  • 11. Measuring Effectiveness with OIQ • Given a ranking ! and a ground-truth g: • Observational Information Effectiveness (OIE): • Linear combination of observational entropies of single and joint signals • Inspired by the Informational Contrast Model defined for text similarity [Amigó et al. 2017; Hick, 1952] • " !, $ captures how similar the ranking to the ground-truth is
  • 12. How OIE explains effectiveness • OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013] • If β > 0: • Priority: Swapping contiguous documents in concordance with the gold increases effectiveness • Deepness: The effect of swapping is larger at the top of the ranking • Deepness Threshold: Retrieving one relevant document is better than a huge amount of relevant documents after a huge set of irrelevant documents • If 1 < β < 9:;< : : • Closeness Threshold: there exists a certain area at the top of the ranking in which n relevant documents is better than only one (the user always inspects at least the n first documents) • If α< > 0 and B > α<: • Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
  • 13. How OIE explains effectiveness • OIE satisfies a number of formal constraints for effectiveness [Amigó et al. 2013] • If β > 0: • Priority: Swapping contiguous documents in concordance with the gold increases effectiveness • Deepness: The effect of swapping is larger at the top of the ranking • Deepness Threshold: Retrieving one relevant document is better than a huge amount of relevant documents after a huge set of irrelevant documents • If 1 < β < 9:;< : : • Closeness Threshold: there exists a certain area at the top of the ranking in which n relevant documents is better than only one (the user always inspects at least the n first documents) • If α< > 0 and B > α<: • Confidence: Adding irrelevant documents at the bottom of the ranking decreases effectiveness
  • 15. Ranking Fusion with OIQ Experiment • Gov-2 collection and the topics 701 to 750 used in the TREC 2004 Terabyte Track • 60 official runs, top 100 documents in the rankings. • Random sample of test cases: 1 topic; Γ = 5 runs; " = 1 run from Γ • Assumption: Adding signals increases the probability of estimating relevance under an OIQ increase
  • 16. Ranking Fusion with OIQ 0.75 0.80 0.85 0.90 0.95 1.00 0.75 0.80 0.85 0.90 0.95 1.00 P(d ≥ gd′|d ≥ γd′) P(d≥gd′|d≥IΓ d′) • X-axis: Probability of relevance estimated by a single signal • Y-axis: Probability of relevance estimated by a set of signals Adding signals helps most of the times! In the paper: OIQ is (only!) as effective as Borda count
  • 17. Summary • Can we explain phenomena in IR such as effectiveness measurement and ranking fusion with a common theoretical framework? • Observational Information Quantity (OIQ) • Documents are more likely to be relevant as higher the information quantity of their observations (in signals) is • An evaluation measure derived by this framework (OIE) satisfies formal constraints for ranking effectiveness • Using OIQ as ranking fusion method outperforms single signals and performs similarly to (but not better than) other ranking fusion methods (Borda Count) Future work: Does OIQ explain other IR phenomena? GLARE CIKM’18 workshop paper with preliminary resultsEvaluation without human assessments? Weak supervision?
  • 18. A Formal Account of Effectiveness Evaluation and Ranking Fusion Enrique Amigó, Fernando Giner, Stefano Mizzaro, Damiano Spina http://bit.ly/ObservationalInformationQuantity_proofs Formal proofs available at: