SlideShare a Scribd company logo
1 of 61
Download to read offline
SIGIR 2013
Dublin, Ireland · July 30thPicture by Philip Milne
On the Measurement of
Test Collection Reliability
@julian_urbano University Carlos III of Madrid
Mónica Marrero University Carlos III of Madrid
Diego Martín Technical University of Madrid
Gratefully supported
by Student Travel Grant
Is System A More Effective
than System B?
-1 1
Δeffectiveness
𝑑0
Is System A More Effective
than System B?
Get a test collection and evaluate
Measure the average difference 𝒅
and conclude which one is better
Samples
Test collections are samples from a
larger, possibly infinite, population
Documents, queries and assessors
𝒅 is only an estimate
How reliable is our conclusion?
Reliability vs. Cost
Building reliable collections is easy…
Just use more documents, more queries,
more assessors
…but it is prohibitively expensive
Our best bet is to increase query set size
Data-based approach
1.Randomly split query set
2.Compute indicators of reliability
based on those two subsets
3.Extrapolate to larger query sets
..with some variations
Voorhees’98, Zobel’98, Buckley & Voorhees’00,
Voorhees & Buckley’02, Sanderson & Zobel’05,
Sakai’07, Voorhees’09
Data-based Reliability Indicators
based on results with two collections
Kendall 𝝉 correlation
stability of the ranking of systems
𝝉 𝑨𝑷 correlation
add a top-heaviness components
Absolute sensitivity
minimum absolute 𝒅 s.t. swaps <5%
Relative sensitivity
minimum relative 𝒅 s.t. swaps <5%
Data-based Reliability Indicators
based on results with two collections
Power ratio
statistically significant results
Minor conflict ratio
statistically non-significant swap
Major conflict ratio
statistically significant swap
RMSE
differences in 𝒅
Generalizability Theory
Directly address variability of scores
G-study
Estimate variance components
from previous, representative, data
D-study
Estimate reliability based on
estimated variance components
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal!
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
G-study
𝝈 𝟐
= 𝝈 𝒔
𝟐
+ 𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
Estimated using Analysis of Variance
From previous data,
usually an existing test collection
system
differences,
our goal! query
difficulty
some systems
better for
some queries
D-study
Relative stability
𝑬𝝆 𝟐
=
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Absolute stability
𝚽 =
𝝈 𝒔
𝟐
𝝈 𝒔
𝟐
+
𝝈 𝒒
𝟐
+ 𝝈 𝒔:𝒒
𝟐
𝒏 𝒒
′
Easy to estimate how many queries we
need for a certain stability level
Generalizability Theory
Proposed by Bodoff’07
Kanoulas & Aslam’09
derive optimal gain & discount in nDCG
TREC Million Query Track
≈80 queries sufficient for stable rankings
≈130 queries for stable absolute scores
In this Paper / Talk
How sensitive is the D-study to the
initial data used in the G-study?
How to interpret G-theory in practice,
why 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?
From the above two, review the
reliability of >40 TREC test collections
variability of G-theory
indicators of reliability
Data
43 TREC collections
from TREC-3 to TREC 2011
12 tasks across 10 tracks
Ad Hoc, Web, Novelty, Genomics,
Robust, Terabyte, Enterprise, Million
Query, Medical and Microblog
Experiment
Vary number of queries in G-study
from 𝒏 𝒒 = 𝟓 to full set
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
, 𝚽
Compute 𝒏 𝒒
′
to reach 0.95 stability
200 random trials
Variability due to queries
Variability due to queries
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.3, depending on
what 10 queries we use
Experiment (II)
The same, but vary number of systems
from 𝒏 𝒔 = 𝟓 to full set
Use all queries available
200 random trials
Variability due to systems
Variability due to systems
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use
Results
G-Theory is very sensitive to initial data
Need about 50 queries and 50 systems for
differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1
Number of queries for 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓
may change in orders of magnitude
Microblog2011 (all 184 systems and 30 queries):
need 63 to 133 queries
Medical2011 (all 34 queries and 40 systems):
need 109 to 566 queries
Use Confidence Intervals
Bodoff’08
Confidence intervals in G-study
But what about the D-study?
Feldt’65 and Arteaga et al.’82
Work reasonably well even when
assumptions are violated Brennan’01
Example
Example
Example
Account for variability
in initial data
Example
Required number of
queries to reach the
lower end of the interval
Summary in TREC
that is, the 43 collections we study here
𝑬𝝆 𝟐
: mean=0.88 sd=0.1
95% conf. intervals are 0.1 long
𝚽: mean=0.74 sd=0.2
95% conf. intervals are 0.19 long
interpretation of G-Theory
indicators of reliability
Experiment
Split query set in 2 subsets
from 𝒏 𝒒 = 𝟏𝟎 to full set / 2
Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐
and 𝚽 and map onto 𝝉,
sensitivity, power, conflicts, etc.
50 random trials
>28,000 datapoints
Example: 𝑬𝝆 𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
Million
Query
2007
Million Query 2008
*All mappings in the paper
Future Predictions
Allows us to make more informed
decisions within a collection
What about a new collection?
Fit a single model for each mapping
with 90% and 95% prediction intervals
Assess whether a larger collection
is really worth the effort
Example: 𝑬𝝆 𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection target
*All mappings in the paper
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
review of TREC collections
Outline
Estimate 𝑬𝝆 𝟐
and 𝚽, with 95%
confidence intervals, and full query set
Map onto 𝝉, sensitivity, power,
conflicts, etc.
Results within task offer historical
perspective since 1994
Example: Ad Hoc 3-8
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %
𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑]
Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]
50 queries were used
*All collections and mappings in the paper
Example: Web Ad Hoc
TREC-8 to TREC-2001: WT2g and WT10g
𝑬𝝆 𝟐
∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎
TREC-2009 to TREC-2011: ClueWeb09
𝑬𝝆 𝟐
∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗]
Queries to get 𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖
50 queries were used
Historical Trend
Decreasing within and across tracks?
Historical Trend
Systems getting better for specific problems?
Historical Trend
Increasing task-specificity in queries?
summing up
Generalizability Theory
Regarded as more appropriate,
easy to use and powerful tool
to assess test collection reliability
Very sensitive to the initial data
used to estimate variance components
Almost impossible to interpret
in practical terms
Sensitivity of G-Theory
About 50 queries and 50 systems
are needed for robust estimates
Caution if building a new collection
Can always use confidence intervals
Interpretation of G-Theory
Empirical mapping onto traditional
indicators of reliability like 𝝉 correlation
𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐
≈ 𝟎. 𝟗𝟕
𝑬𝝆 𝟐
= 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
Historical Reliability in TREC
On average, 𝑬𝝆 𝟐
= 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕
Some collections clearly unreliable
Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
50 queries not enough for stable
rankings, about 200 are needed
Implications
Fixing a minimum number of queries
across tracks is unrealistic
Not even across editions of the same task
Need to analyze on a case-by-case
basis, while building the collections
to be continued…
Future Work
Study assessor effect
Study document-collection effect
Better models to map G-Theory
onto data-based indicators
We fitted theoretically correct(-ish) models,
but in practice theory does not hold
Methods to reliably measure reliability
while building the collection
Source Code Online
Code for R stats software
G-study and D-study
Required number of queries
Map onto data-based indicators
Confidence intervals
..in two simple steps
G-Theory too sensitive to initial data
Questionable with small collections
Compute confidence intervals
Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗
50 queries not enough for stable rankings
Fixing a minimum number of
queries across tasks is unrealistic
Need to analyze on a case-by-case basis

More Related Content

What's hot

Why you need power analysis
Why you need power analysisWhy you need power analysis
Why you need power analysispcdjohnson
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Matt Hansen
 
MAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingMAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingPaul Irwing
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Matt Hansen
 
Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Matt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Matt Hansen
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Matt Hansen
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsMatt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Matt Hansen
 
Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power nQuery
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11Mazhar Poohlah
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004Salford Systems
 
Breakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsBreakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsStatistics Solutions
 

What's hot (19)

Why you need power analysis
Why you need power analysisWhy you need power analysis
Why you need power analysis
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
MAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. IrwingMAT80 - White paper july 2017 - Prof. P. Irwing
MAT80 - White paper july 2017 - Prof. P. Irwing
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)
 
Statistics Homework Help
Statistics Homework HelpStatistics Homework Help
Statistics Homework Help
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)
 
Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)Hypothesis Testing: Proportions (Compare 2+ Factors)
Hypothesis Testing: Proportions (Compare 2+ Factors)
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
 
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)Hypothesis Testing: Spread (Compare 1:Standard)
Hypothesis Testing: Spread (Compare 1:Standard)
 
abcxyz
abcxyzabcxyz
abcxyz
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence Intervals
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
 
Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power Webinar slides- alternatives to the p-value and power
Webinar slides- alternatives to the p-value and power
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11
 
Ijcatr04051005
Ijcatr04051005Ijcatr04051005
Ijcatr04051005
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Breakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsBreakdown of Regression Models for Dissertations
Breakdown of Regression Models for Dissertations
 

Viewers also liked

Language testing - Contrastive analysis
Language testing - Contrastive analysis Language testing - Contrastive analysis
Language testing - Contrastive analysis King Saud University
 
Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Alfi Suru
 
3 basic-principles_of_assessment
3  basic-principles_of_assessment3  basic-principles_of_assessment
3 basic-principles_of_assessmenthakim azman
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011photomatt
 

Viewers also liked (6)

Language testing - Contrastive analysis
Language testing - Contrastive analysis Language testing - Contrastive analysis
Language testing - Contrastive analysis
 
Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)Principles of language assessment ( evaluation of language teaching)
Principles of language assessment ( evaluation of language teaching)
 
Reliability
ReliabilityReliability
Reliability
 
3 basic-principles_of_assessment
3  basic-principles_of_assessment3  basic-principles_of_assessment
3 basic-principles_of_assessment
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to On the Measurement of Test Collection Reliability

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxnagarajan740445
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Bayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarBayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarnQuery
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014tjcarter
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesnQuery
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librariansJohn McDonald
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingMeghana Gowda
 
ensemble learning
ensemble learningensemble learning
ensemble learningbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 

Similar to On the Measurement of Test Collection Reliability (20)

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Bayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarBayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size Webinar
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librarians
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modeling
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 

More from Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

On the Measurement of Test Collection Reliability

  • 1. SIGIR 2013 Dublin, Ireland · July 30thPicture by Philip Milne On the Measurement of Test Collection Reliability @julian_urbano University Carlos III of Madrid Mónica Marrero University Carlos III of Madrid Diego Martín Technical University of Madrid
  • 3. Is System A More Effective than System B? -1 1 Δeffectiveness 𝑑0
  • 4. Is System A More Effective than System B? Get a test collection and evaluate Measure the average difference 𝒅 and conclude which one is better
  • 5. Samples Test collections are samples from a larger, possibly infinite, population Documents, queries and assessors 𝒅 is only an estimate How reliable is our conclusion?
  • 6. Reliability vs. Cost Building reliable collections is easy… Just use more documents, more queries, more assessors …but it is prohibitively expensive Our best bet is to increase query set size
  • 7. Data-based approach 1.Randomly split query set 2.Compute indicators of reliability based on those two subsets 3.Extrapolate to larger query sets ..with some variations Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05, Sakai’07, Voorhees’09
  • 8. Data-based Reliability Indicators based on results with two collections Kendall 𝝉 correlation stability of the ranking of systems 𝝉 𝑨𝑷 correlation add a top-heaviness components Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5% Relative sensitivity minimum relative 𝒅 s.t. swaps <5%
  • 9. Data-based Reliability Indicators based on results with two collections Power ratio statistically significant results Minor conflict ratio statistically non-significant swap Major conflict ratio statistically significant swap RMSE differences in 𝒅
  • 10. Generalizability Theory Directly address variability of scores G-study Estimate variance components from previous, representative, data D-study Estimate reliability based on estimated variance components
  • 11. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection
  • 12. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal!
  • 13. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty
  • 14. G-study 𝝈 𝟐 = 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 Estimated using Analysis of Variance From previous data, usually an existing test collection system differences, our goal! query difficulty some systems better for some queries
  • 15. D-study Relative stability 𝑬𝝆 𝟐 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 ′ Absolute stability 𝚽 = 𝝈 𝒔 𝟐 𝝈 𝒔 𝟐 + 𝝈 𝒒 𝟐 + 𝝈 𝒔:𝒒 𝟐 𝒏 𝒒 ′ Easy to estimate how many queries we need for a certain stability level
  • 16. Generalizability Theory Proposed by Bodoff’07 Kanoulas & Aslam’09 derive optimal gain & discount in nDCG TREC Million Query Track ≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores
  • 17. In this Paper / Talk How sensitive is the D-study to the initial data used in the G-study? How to interpret G-theory in practice, why 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓? From the above two, review the reliability of >40 TREC test collections
  • 19. Data 43 TREC collections from TREC-3 to TREC 2011 12 tasks across 10 tracks Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
  • 20. Experiment Vary number of queries in G-study from 𝒏 𝒒 = 𝟓 to full set Use all runs available Run D-study Compute 𝑬𝝆 𝟐 , 𝚽 Compute 𝒏 𝒒 ′ to reach 0.95 stability 200 random trials
  • 22. Variability due to queries We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.3, depending on what 10 queries we use
  • 23. Experiment (II) The same, but vary number of systems from 𝒏 𝒔 = 𝟓 to full set Use all queries available 200 random trials
  • 25. Variability due to systems We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.5, depending on what 20 systems we use
  • 26. Results G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1 Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries
  • 27. Use Confidence Intervals Bodoff’08 Confidence intervals in G-study But what about the D-study? Feldt’65 and Arteaga et al.’82 Work reasonably well even when assumptions are violated Brennan’01
  • 31. Example Required number of queries to reach the lower end of the interval
  • 32. Summary in TREC that is, the 43 collections we study here 𝑬𝝆 𝟐 : mean=0.88 sd=0.1 95% conf. intervals are 0.1 long 𝚽: mean=0.74 sd=0.2 95% conf. intervals are 0.19 long
  • 34. Experiment Split query set in 2 subsets from 𝒏 𝒒 = 𝟏𝟎 to full set / 2 Use all runs available Run D-study Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc. 50 random trials >28,000 datapoints
  • 35. Example: 𝑬𝝆 𝟐 → 𝝉 *All mappings in the paper
  • 36. Example: 𝑬𝝆 𝟐 → 𝝉 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85 *All mappings in the paper
  • 37. Example: 𝑬𝝆 𝟐 → 𝝉 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 *All mappings in the paper
  • 38. Example: 𝑬𝝆 𝟐 → 𝝉 Million Query 2007 Million Query 2008 *All mappings in the paper
  • 39. Future Predictions Allows us to make more informed decisions within a collection What about a new collection? Fit a single model for each mapping with 90% and 95% prediction intervals Assess whether a larger collection is really worth the effort
  • 40. Example: 𝑬𝝆 𝟐 → 𝝉 *All mappings in the paper
  • 41. Example: 𝑬𝝆 𝟐 → 𝝉 current collection *All mappings in the paper
  • 42. Example: 𝑬𝝆 𝟐 → 𝝉 current collection target *All mappings in the paper
  • 43. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 44. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 45. review of TREC collections
  • 46. Outline Estimate 𝑬𝝆 𝟐 and 𝚽, with 95% confidence intervals, and full query set Map onto 𝝉, sensitivity, power, conflicts, etc. Results within task offer historical perspective since 1994
  • 47. Example: Ad Hoc 3-8 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 % 𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 % Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗] 50 queries were used *All collections and mappings in the paper
  • 48. Example: Web Ad Hoc TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎 TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖 50 queries were used
  • 50. Historical Trend Systems getting better for specific problems?
  • 53. Generalizability Theory Regarded as more appropriate, easy to use and powerful tool to assess test collection reliability Very sensitive to the initial data used to estimate variance components Almost impossible to interpret in practical terms
  • 54. Sensitivity of G-Theory About 50 queries and 50 systems are needed for robust estimates Caution if building a new collection Can always use confidence intervals
  • 55. Interpretation of G-Theory Empirical mapping onto traditional indicators of reliability like 𝝉 correlation 𝝉 = 𝟎. 𝟗 → 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
  • 56. Historical Reliability in TREC On average, 𝑬𝝆 𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕 Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 50 queries not enough for stable rankings, about 200 are needed
  • 57. Implications Fixing a minimum number of queries across tracks is unrealistic Not even across editions of the same task Need to analyze on a case-by-case basis, while building the collections
  • 59. Future Work Study assessor effect Study document-collection effect Better models to map G-Theory onto data-based indicators We fitted theoretically correct(-ish) models, but in practice theory does not hold Methods to reliably measure reliability while building the collection
  • 60. Source Code Online Code for R stats software G-study and D-study Required number of queries Map onto data-based indicators Confidence intervals ..in two simple steps
  • 61. G-Theory too sensitive to initial data Questionable with small collections Compute confidence intervals Need 𝑬𝝆 𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings Fixing a minimum number of queries across tasks is unrealistic Need to analyze on a case-by-case basis