Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Upcoming SlideShare
Loading in...5
×
 

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

on

  • 432 views

 

Statistics

Views

Total Views
432
Slideshare-icon Views on SlideShare
405
Embed Views
27

Actions

Likes
1
Downloads
4
Comments
0

1 Embed 27

https://twitter.com 27

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Validity and Reliability of Cranfield-like Evaluation in Information Retrieval Validity and Reliability of Cranfield-like Evaluation in Information Retrieval Presentation Transcript

    • Validity and Reliability of Cranfield-like Evaluation in Information Retrieval Julián Urbano Picture by Tom Parnell Glasgow, Scotland · September 2013
    • Talk outline • Why we want to Evaluate… • …and what we do with Cranfield • Validity: users versus systems • Reliablity: estimating from samples
    • WhywewanttoEvaluate
    • The two questions • How good is my system? – What does good mean? – What is good enough? • Is system A better than system B? – What does better mean? – How much better? • Efficiency? Effectiveness? Ease?
    • Measure user experience • Time to complete task • Idle time • Success rate • Failure rate • Frustration • Ease to learn • Ease to use …and a long etcetera
    • We want to know some distributions • For an arbitrary user, need and document collection, what is the distribution of: • They describe user experience, fully 0 time to complete task none frustration muchsome
    • The big(ger) picture • Different user-measures attempting to assess the same thing: user satisfaction – How likely is it that an arbitrary user, with an arbitrary need (and with an arbitrary document collection) will be satisfied by the system? • This is the ultimate goal: the good, the better
    • The big(ger) question • User satisfaction…as Bernoulli trial • Probability of satisfaction? • Probability that k in n users are satisfied? • Probability of >80% users satisfied? satisfaction yesno
    • Whatwedowith Cranfield
    • Sources of variability user-measure = f(documents, need, user, system) • Try to estimate the user-measure distribution – Sample documents, needs and users – Problematic • Representativeness • Cost • Ethics – Hard to replicate and repeat results
    • Fix samples • Get a (hopefully) good sample and fix it – Document collection – Topic set – A step towards reproducibility • Still have to sample users, but can’t fix them! – Very large source of variability – Hard to replicate and repeat experiments – Complex, costly, ethical issues – Example: ASTIA-Uniterm studies
    • Simulate users…and fix them • Cleverdon’s idea: remove users, but include a static user component, fixed across experiments – The judgments in the ground truth • Remove all sources of variability, except systems user-measure = f(documents, need, user, system)
    • Simulate users…and fix them • Cleverdon’s idea: remove users, but include a static user component, fixed across experiments – The judgments in the ground truth • Remove all sources of variability, except systems user-measure = f(documents, need, user, system) user-measure = f(system)
    • Test collections user-measure = f(system) • Test collections are tools to estimate distributions of user-measures – Reproducibility becomes possible and easy – Experiments are inexpensive (collections are not) – Research becomes systematic
    • Wait a minute • Are we estimating distributions about users or distributions about systems? system-effectiveness = f(system, measure) • We come up with different distributions of system-effectiveness, one per measure • Each measure has its own assumptions
    • Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
    • Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
    • Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
    • Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
    • Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
    • Assumption • Well, at least we assume the correlation – Are they correlated? How well? • Test collections: estimators of user distributions – What we want to measure: user satisfaction – What we do measure: system effectiveness
    • Validity and Reliability • Validity: are we measuring what we want to? – External validity: Are topics, documents and assessors representative? – Construct validity: Do system-measures correspond to user-measures? – Conclusion validity: Is system A really better than system B? • Reliability: how repeatable are the results? – How large do collections have to be to ensure repeatability with a different sample?
    • Validity
    • Assumption • Systems with better effectiveness are perceived by users as more useful, more satisfactory • Tricky: different effectiveness measures and relevance scales give different results – Which one is better to predict satisfaction? • The goal is user satisfaction, not system effectiveness
    • Mapping • Try to map system effectiveness onto user satisfaction, experimentally • If P@10 = 0.2, how likely is it that the user will find the results satisfactory? • What if DCG@20 = 0.467? • What if ERR = 0.9?
    • User-oriented System-measures • Effectiveness measures are generally not formulated to correlate with user-satisfaction • If effectiveness is 0, we expect 0% probability of user satisfaction • If effectiveness is 1, we expect 100% probability • If effectiveness is 𝜆, we expect 100𝜆% • But this is not what we have
    • Unbounded measures 𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘 𝑖=1 • Upper bound depends on cutoff, gain function and relevance scale – Normalize effectiveness between 0 and 1 – What is the best we can do with 𝑘 documents? 𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑔𝑎𝑖𝑛 𝑟𝑖 ∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘 𝑖=1
    • Recall-oriented measures 𝐴𝑃@𝑘 = 1 ℛ1 𝑟i · 𝑃@𝑖 𝑘 𝑖=1 • 𝐴𝑃@𝑘 = 1 only possible if 𝑘 ≥ ℛ1 • Reformulate towards users – What is the best we can do with 𝑘 documents, regardless of the judgments in the ground truth? 𝐴𝑃@𝑘 = 1 𝑘 𝑟𝐴 𝑖 · 𝑃@𝑖 𝑘 𝑖=1
    • Ideal ranking 𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 𝑔𝑎𝑖𝑛 𝑖𝑑𝑒𝑎𝑙𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 • If there is only one relevant, 𝑛𝐷𝐶𝐺@10 = 1 even if we retrieve nine nonrelevants • Assume the ideal ranking has only excellent documents, with maximum relevance 𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 𝑔𝑎𝑖𝑛 𝑟𝑖 ∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 • This is basically user-oriented 𝐷𝐶𝐺@𝑘
    • Audio Music Similarity • Song as input to system, audio signal • Retrieve songs musically similar to it, by content • Resembles traditional Ad Hoc retrieval in Text IR • (most?) Important task in Music IR – Music recommendation – Playlist generation – Plagiarism detection
    • Measures • All reformulated, user-oriented – What is the best we can do under the user model? • Binary – P, AP, RR • Graded – CG, DCG, Q, RBP, ERR, GAP, ADR , EDCG – Linear and exponential gains
    • Relevance scales • Originally used – Broad: 3 levels – Fine: 101 levels • Artificially made from the Fine scale – Graded with 3, 4 and 5 levels, evenly spaced – Binary, with threshold equal 20, 40, 60 and 80
    • Measures and Scales Measure Original Artificial Graded Artificial Binary Broad Fine 𝑛ℒ = 3 𝑛ℒ = 4 𝑛ℒ = 5 ℓ 𝑚𝑖𝑛 = 20 ℓ 𝑚𝑖𝑛 = 40 ℓ 𝑚𝑖𝑛 = 60 ℓ 𝑚𝑖𝑛 = 80 𝑃@5 x x x x 𝐴𝑃@5 x x x x 𝑅𝑅@5 x x x x 𝐶𝐺𝑙@5 x x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5 𝐶𝐺𝑒@5 x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5 𝐷𝐶𝐺𝑙@5 x x x x x x x x x 𝐷𝐶𝐺𝑒@5 x x x x 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 x x x x x x x x x 𝐸𝐷𝐶𝐺𝑒@5 x x x x 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝑄𝑙@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝑄 𝑒@5 x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝑅𝐵𝑃𝑙@5 x x x x x x x x x 𝑅𝐵𝑃𝑒@5 x x x x 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝐸𝑅𝑅𝑙@5 x x x x x x x x x 𝐸𝑅𝑅 𝑒@5 x x x x 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐺𝐴𝑃@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝐷𝑅@5 x x x x x x x x
    • Experimental Design
    • Experimental Design user preference (agrees or disagrees with effectiveness)
    • Experimental Design non-preference (can’t decide)
    • What can we infer? • Preference: difference noticed by user – Positive: user agrees with evaluation – Negative: user disagrees with evaluation • Non-preference: difference not noticed by user – Good: both systems are satisfactory – Bad: both systems are not satisfactory
    • Data • Queries, documents and judgments from MIREX – MIREX: TREC-like evaluation forum in Music IR • 4,115 unique and artificial examples – Covering full range of effectiveness • In 10 bins 0, 0.1 , 0.1, 0.2 , … , [0.9, 1] – At least 200 examples per measure/scale/bin • 432 unique queries, 5,636 unique documents
    • Collecting User Preferences • Crowdsourcing – Quality control through trap examples • Total: 547 unique subjects, 11,042 preferences • Accepted: 175 subjects, 9,373 preferences • After trap questions: 113 subjects
    • Single system: how good is it? • 2,045 non-preferences (49%) – 1,056 satisfactory – 969 non-satisfactory What do we expect?
    • Single system: how good is it? • 2,045 non-preferences (49%) – 1,056 satisfactory – 969 non-satisfactory Linear mapping
    • Single system: how good is it? Large thresholds underestimate satisfaction
    • Single system: how good is it? Ranking does not affect satisfaction?
    • Single system: how good is it? Exponential gain underestimates satisfaction
    • Single system: how good is it? • Best adhere to the diagonal – 𝐶𝐺𝑙@5, 𝐷𝐶𝐺𝑙@5 and 𝑅𝐵𝑃𝑙@5 – Not necessarily better: just easier to interpret • About 20% bias at endpoints – Room for improvement with personalization • Less sensitive to subjectivity in relevance – Minimize 𝑃(𝑆𝑎𝑡│0) and maximize 𝑃(𝑆𝑎𝑡│1) – ℓ 𝑚𝑖𝑛 = 40 and 𝐵𝑟𝑜𝑎𝑑 behave better – 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
    • Two systems: which one is better? • 2,090 preferences (51%) – 1,019 for system A – 1,071 for system B What do we expect?
    • Two systems: which one is better? • 2,090 preferences (51%) – 1,019 for system A – 1,071 for system B Users always notice the difference… …regardless of how large it is
    • Two systems: which one is better? Need quite large differences!
    • Two systems: which one is better? More relevance levels better to discriminate
    • Two systems: which one is better? Bad correlation?
    • Two systems: which one is better? • Users prefer the (supposedly) worse system
    • User Agrees with Evaluation • Closer to ideal 𝑃 𝐴𝑔𝑔 = 1 Δ𝜆 = 1 – ℓ 𝑚𝑖𝑛 = 80 better among binaries – 𝐹𝑖𝑛𝑒 better for linear gain – 𝑛ℒ = 5 better for exponential gain – 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
    • User Disagrees with Evaluation • Closer to ideal 𝑃 𝐴𝑔𝑔 = −1 Δ𝜆 = 0 – ℓ 𝑚𝑖𝑛 = 40 better among binaries – 𝐹𝑖𝑛𝑒 better for linear gain – 𝐵𝑟𝑜𝑎𝑑 better with exponential gain – 𝐶𝐺@5, 𝐺𝐴𝑃@5, 𝐷𝐶𝐺@5 and 𝑅𝐵𝑃@5
    • Summary • Linear gain better than exponential gain – Except, slightly, in terms of disagreements • Measures oriented to a single document are not appropriate for a music recommendation setting • Gain is independent of other documents • 𝐵𝑟𝑜𝑎𝑑 better to predict satisfaction • 𝐹𝑖𝑛𝑒 better to predict user agreement • Binary scales worst overall
    • Summary • We can map system effectiveness onto probability of user satisfaction • ~20% of users disagree with effectiveness – Practical upper (and lower) bound in evaluation – Need to incorporate user profiles • Somehow included in MSD Challenge • Δ𝜆 ≈ 0.4 needed for users to agree – Historically observed only 20% of times in MIREX – Be careful with statistical significance!
    • Satisfactionoversamples
    • User Satisfaction • So far only for a query and a user (Bernoulli) – 𝑃 𝑆𝑎𝑡 𝜆 𝑞 • Easily for 𝑛 users (Binomial) – 𝑃 𝑆𝑎𝑡 𝑛 = 𝑘 𝜆 𝑞 • Example: 𝑄𝑙@5 = 0.61 – 𝑃 𝑆𝑎𝑡 ≈ 0.7 – 𝑃 𝑆𝑎𝑡15 = 10 ≈ 0.21 • What about a sample of queries 𝒬?
    • User Satisfaction over a Sample 𝐸 𝑃 𝑆𝑎𝑡 = 1 𝑛 𝒬 𝑃 𝑆𝑎𝑡 𝜆 𝑞 𝑞∈𝒬 • Example: satisfaction is underestimated
    • System Success • If 𝑃 𝑆𝑎𝑡 ≥ 𝑡𝑕𝑟𝑒𝑠𝑕𝑜𝑙𝑑 the system is successful • If we want the majority of users to be satisfied – 𝑃 𝑆𝑢𝑐𝑐 = 1 − F 𝑃 𝑆𝑎𝑡 0.5 • Intuition: improving bad queries is worthier than further improving good ones
    • System Success • Example: – 𝐸 Δ𝜆 = −0.0021
    • System Success • Example: – 𝐸 Δ𝜆 = −0.0021 – 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011
    • System Success • Example: – 𝐸 Δ𝜆 = −0.0021 – 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011 – 𝐸 Δ𝑃 𝑆𝑢𝑐𝑐 = 0.07
    • Summary • Need to consider full distributions – Always average or good on average? • Modeling full distribution – Normal for small query sets, Empirical for large – Beta always better for 𝐹𝑖𝑛𝑒 scale
    • Summary • Intuitive interpretations of effectiveness fail – Contradictory results in terms of user satisfaction
    • Reliability
    • Samples • Test collections are samples from larger, possibly infinite, populations – Documents, queries and users • Δ𝜆 is just an estimate of the population mean 𝜇Δ𝜆 • How reliable is our conclusion?
    • Reliability vs Cost • Building reliable collections is easy • Just use more documents, queries and assessors • But it is prohibitively expensive • Best option is to increase query set size – Largest source of variability • How many queries? – First we need to measure reliability
    • Data-based approach 1. Randomly split query set 2. Compute indicators of reliability based on these two query subsets 3. Extrapolate to larger query sets …with some variations
    • Data-based reliability indicators • Compare results with two collections – Kendall tau correlation – AP correlation – Absolute sensitivity – Relative sensitivity – Power ratio – Minor conflict ratio – Major conflict ratio – RMSE
    • Generalizability Theory approach • Address variability of scores, not just means • G-study – Estimate variance components from previous, representative data – Usually previous test collections • D-study – Estimate reliability based on estimated variance components from G-study
    • G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance
    • G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal!
    • G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal! query difficulty
    • G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal! query difficulty some systems better for some queries
    • D-study • Relative stability: 𝐸𝜌2 = 𝜎𝑠 2 𝜎𝑠 2+ 𝜎 𝑠:𝑞 2 𝑛 𝑞 ′ • Absolute stability: Φ = 𝜎𝑠 2 𝜎𝑠 2+ 𝜎 𝑞 2+𝜎 𝑠:𝑞 2 𝑛 𝑞 ′ • Easy to estimate how many queries we need to reach a certain stability level (1MQ track) – ≈80 queries sufficient for stable rankings – ≈130 queries for stable absolute scores
    • G-Theory approach • How sensitive is the D-study to the initial data used in the G-study? • How should we interpret G-Theory indicators in practice? What does 𝐸𝜌2 = 0.95 mean? • From the above, review reliability of over 40 TREC test collections
    • Data • 43 TREC collections – From TREC 3 to TREC 2011 • 12 tasks across 10 tracks – Ad hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
    • Sensitivity: experiment • Vary number of queries in G-study – From 𝑛 𝑞 = 5 to full set – Use all runs available • Run D-study – Compute 𝐸𝜌2 and Φ – Compute 𝑛 𝑞 ′ to reach 0.95 stability • 200 random trials
    • Variability due to queries
    • We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.3, depending on what 10 queries we use Variability due to queries
    • Sensitivity: experiment • Do the same, but vary number of systems – From 𝑛 𝑠 = 5 to full set – Use all queries available • 200 random trials
    • Variability due to systems
    • We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.5, depending on what 20 systems we use Variability due to systems
    • Results • G-Theory is very sensitive to initial data – Need about 50 queries and 50 systems for differences in 𝐸𝜌2 and Φ below 0.1 • Number of queries for 𝐸𝜌2 = 0.95 may change in orders of magnitude – Microblog2011 (all 184 systems and 30 queries) • Need 63 to 133 queries – Medical2011 (all 34 queries and 40 systems) • Need 109 to 566 queries
    • Compute confidence intervals
    • Compute confidence intervals
    • Account for variability in initial data Compute confidence intervals
    • Required number of queries to reach the lower end of the interval Compute confidence intervals
    • Summary in TREC • 𝐸𝜌2 : mean=0.88 sd=0.1 – 95% conf. intervals are 0.1 long • Φ: mean=0.74 sd=0.2 – 95% conf. intervals are 0.19 long
    • Interpretation: experiment • Split query set in 2 subsets – From 𝑛 𝑞 = 10 to full set / 2 – Use all runs available • Run D-study – Compute 𝐸𝜌2 and Φ and map onto 𝜏, sensitivity, power, conflicts, etc. • 50 random trials – Over 28,000 datapoints
    • *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • Million Query 2007 Million Query 2008 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • Future predictions • This allows us to make more informed decisions within a collection • What about a new collection? – Fit a single model for each mapping with 90% and 95% prediction intervals • Assess whether a larger collection is really worth the effort
    • *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • current collection *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • current collection target *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
    • Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
    • Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
    • Summary • G-Theory is regarded as more appropriate, ease to use and powerful to assess reliability than the traditional data-based approaches • But it is quite sensitive to initial data used to estimate variance components – Data-based approaches are too! • and almost impossible to interpret in practice
    • Summary • Need about 50 queries and 50 systems to have robust estimates of reliability – That is a whole collection already! – Need to use confidence intervals • Previous interpretation overestimated reliability – 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 – 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
    • Reliability:reviewofTRECcollections
    • Outline • Estimate 𝐸𝜌2 and Φ, with 95% confidence intervals, and full query set • Map onto 𝜏, sensitivity, power, conflicts, etc. • Results within tasks offer a historical perspective on reliability since 1994
    • *All collections and mappings in the paper Example: Ad hoc 3-8 • 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81] • 𝑚𝑖𝑛𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.6, 8.2 % • 𝑚𝑎𝑗𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.02, 1.38 % • Queries to get 𝐸𝜌2 = 0.95: [37,233] • Queries to get Φ = 0.95: [116,999] • 50 queries were used
    • Example: Web ad hoc • TREC-8 to TREC-2001: WT2g and WT10g – 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81] – Queries to get 𝐸𝜌2 = 0.95: 40,220 • TREC-2009 to TREC-2011: ClueWeb09 – 𝐸𝜌2 ∈ 0.8,0.83 → 𝜏 ∈ [0.53,0.59] – Queries to get 𝐸𝜌2 = 0.95: 107,438 • 50 queries were used
    • Historical trend • Decreasing within and across tracks?
    • Historical trend • Systems getting better for specific problems?
    • Historical trend • Increasing task-specificity in queries?
    • Historical reliability in TREC • On average, 𝐸𝜌2 = 0.88 → 𝜏 ≈ 0.7 • Some collections clearly unreliable – Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 • 50 queries not enough for stable rankings, about 200 are needed in most cases
    • Implications • Fixing a minimum number of queries across tracks is unrealistic – Not even across editions of the same task • Need to analyze on a case-by-case basis, while building the collections – GT4IReval, R package online
    • Currentand future work
    • Validity • Similar studies in Text IR to map effectiveness onto user satisfaction • Particularly interesting because there are several query types, and users behave differently – Single measure to use in all cases? – Use different measures and average them all? • Further user studies to figure out what makes users say good and better • How should test collections be extended to incorporate more user information?
    • Reliability • Study assessor effect • Study document collection effect • Better models to map G-theory indicators onto understandable data-based indicators • Methods to reliably measure reliability while building the collection
    • References
    • General • Cleverdon, C. W. (1991). The Significance of the Cranfield Tests on Index Languages. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12). • Sanderson, M. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4), 247–375. • Robertson, S. (2008). On the History of Evaluation in IR. Journal of Information Science, 34(4), 439–456. • Harman, D. K. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119. • Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Workshop of the Cross-Language Evaluation Forum (pp. 355–370). • Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management, 28(4), 467–490. • Gull, C. D. (1956). Seven Years of Work on the Organisation of Materials in a Special Library. American Documentation, 7(4), 320–329. • Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems. • Urbano, J. (2013). Evaluation in Audio Music Similarity. PhD dissertation, University Carlos III of Madrid. • Trochim, W. M. K., & Donnelly, J. P. (2007). The Research Methods Knowledge Base (3rd ed.). Atomic Dog Publishing. • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin. • Zobel, J., Webber, W., Sanderson, M., & Moffat, A. (2011). Principles for Robust Evaluation Infrastructure. In ACM CIKM Workshop on Data infrastructures for Supporting Information Retrieval Evaluation.
    • Validity • Allan, J., Carterette, B., & Lewis, J. (2005). When Will Information Retrieval Be “Good Enough”? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 433–440). • Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User Satisfaction. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 773–774). • Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. (2008). The Good and the Bad System: Does the Test Collection Predict User’s Effectiveness. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 59–66). • Bailey, P., Craswell, N., Soboroff, I., Thomas, P., Vries, A. P. de, & Yilmaz, E. (2008). Relevance Assessment: Are Judges Exchangeable and Does it Matter? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 667–674). • Bennett, P. N., Carterette, B., Chapelle, O., & Joachims, T. (2008). Beyond Binary Relevance: Preferences, Diversity and Set-Level Judgments. ACM SIGIR Forum, 42(2), 53–58. • Carterette, B. (2011). System Effectiveness, User Models, and User Utility: A General Framework for Investigation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 903–912). • Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or There: Preference Judgments for Relevance. In European Conference on Information Retrieval (pp. 16–27). • Carterette, B., & Soboroff, I. (2010). The Effect of Assessor Error on IR System Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 539–546). • Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User Evaluations Give the Same Results? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 17–24).
    • Validity • Hersh, W., Turpin, A., Sacherek, L., Olson, D., Price, S., Chan, B., & Kraemer, D. (2000). Further Analysis of Whether Batch and User Evaluations Give the Same Results With a Question-Answering Task. In Text REtrieval Conference. • Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. In International Society for Music Information Retrieval Conference (pp. 331–336). • Huffman, S. B., & Hochster, M. (2007). How Well does Result Relevance Predict Session Satisfaction? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 567–573). • Ingwersen, P., & Järvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. • Järvelin, K. (2011). IR Research: Systems, Interaction, Evaluation and Theories. ACM SIGIR Forum, 45(2), 17–31. • Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science, 48(9), 810–832. • Sanderson, M., Paramita, M. L., Clough, P., & Kanoulas, E. (2010). Do User Preferences and Evaluation Measures Line Up? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 555– 562). • Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal of Intelligent Information Systems. • Schedl, M., Stober, S., Gómez, E., Orio, N., & Liem, C. C. S. (2012). User-Aware Music Retrieval. In M. Müller, M. Goto, & M. Schedl (Eds.), Multimodal Music Processing (pp. 135–156). Dagstuhl Publishing. • Scholer, F., & Turpin, A. (2008). Relevance Thresholds in System Evaluations. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 693–694).
    • Validity • Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that They Are Unrealistic. In European Workshop on Human-Computer Interaction and Information Retrieval (pp. 11– 12). • Thom, J. A., & Scholer, F. (2007). A Comparison of Evaluation Measures Given How Users Perform on Search Tasks. In Australasian Document Computing Symposium (pp. 100–103). • Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 225–231). • Turpin, A., & Hersh, W. (2002). User Interface Effects in Past Batch Versus User Experiments. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 431–432). • Turpin, A., & Scholer, F. (2006). User Performance Versus Precision Measures for Simple Search Tasks. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 11–18). • Urbano, J., Downie, J. S., Mcfee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval. In International Society for Music Information Retrieval Conference (pp. 181–186).
    • Reliability • Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 Overview. In Text REtrieval Conference. • Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million Query Track 2007 Overview. In Text REtrieval Conference. • Armstrong, T. G., Moffat, A., Webber, W., & Zobel, J. (2009). Improvements that Don’t Add Up: Ad-Hoc Retrieval Results since 1998. In ACM International Conference on Information and Knowledge Management (pp. 601–610). • Banks, D., Over, P., & Zhang, N.-F. (1999). Blind Men and Elephants: Six Approaches to TREC data. Information Retrieval, 1(1-2), 7–34. • Bodoff, D. (2008). Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management, 44(3), 1117–1145. • Bodoff, D., & Li, P. (2007). Test Theory for Assessing IR Test Collections. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 367–374). • Brennan, R. L. (2001). Generalizability Theory. Springer. • Buckley, C., & Voorhees, E. M. (2000). Evaluating Evaluation Measure Stability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 33–34). • Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million Query Track 2009 Overview. In Text REtrieval Conference. • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation Over Thousands of Queries. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 651–658). • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I Had a Million Queries. In European Conference on Information Retrieval (pp. 288–300). • Lin, W.-H., & Hauptmann, A. (2005). Revisiting the Effect of Topic Set Size on Retrieval Error. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 637–638).
    • Reliability • Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 533–540). • Robertson, S., & Kanoulas, E. (2012). On Per-Topic Variance in IR Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 891–900). • Sakai, T. (2007). On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information Processing and Management, 43(2), 531–548. • Sanderson, M., & Zobel, J. (2005). Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169). • Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in Effectiveness Across Sub-collections. In ACM International Conference on Information and Knowledge Management (pp. 1965–1969). • Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications. • Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In ACM International Conference on Information and Knowledge Management (pp. 623– 632). • Urbano, J., Marrero, M., & Martín, D. (2013). A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 925–928). • Urbano, J., Marrero, M., & Martín, D. (2013). On the Measurement of Test Collection Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 393–402). • Voorhees, E. M. (2000). Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Information Processing and Management, 36(5), 697–716. • Voorhees, E. M. (2009). Topic Set Size Redux. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 806–807).
    • Reliability • Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 316–323). • Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. In ACM International Conference on Information and Knowledge Management (pp. 571–580). • Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A New Rank Correlation Coefficient for Information Retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 587–594). • Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314).