SlideShare a Scribd company logo
1 of 122
Download to read offline
Validity and Reliability of
Cranfield-like Evaluation
in Information Retrieval
Julián Urbano
Picture by Tom Parnell Glasgow, Scotland · September 2013
Talk outline
• Why we want to Evaluate…
• …and what we do with Cranfield
• Validity: users versus systems
• Reliablity: estimating from samples
WhywewanttoEvaluate
The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease?
Measure user experience
• Time to complete task
• Idle time
• Success rate
• Failure rate
• Frustration
• Ease to learn
• Ease to use
…and a long etcetera
We want to know some distributions
• For an arbitrary user, need and document
collection, what is the distribution of:
• They describe user experience, fully
0
time to complete task
none
frustration
muchsome
The big(ger) picture
• Different user-measures attempting to assess the
same thing: user satisfaction
– How likely is it that an arbitrary user, with an arbitrary
need (and with an arbitrary document collection) will
be satisfied by the system?
• This is the ultimate goal: the good, the better
The big(ger) question
• User satisfaction…as Bernoulli trial
• Probability of satisfaction?
• Probability that k in n users are satisfied?
• Probability of >80% users satisfied?
satisfaction
yesno
Whatwedowith Cranfield
Sources of variability
user-measure = f(documents, need, user, system)
• Try to estimate the user-measure distribution
– Sample documents, needs and users
– Problematic
• Representativeness
• Cost
• Ethics
– Hard to replicate and repeat results
Fix samples
• Get a (hopefully) good sample and fix it
– Document collection
– Topic set
– A step towards reproducibility
• Still have to sample users, but can’t fix them!
– Very large source of variability
– Hard to replicate and repeat experiments
– Complex, costly, ethical issues
– Example: ASTIA-Uniterm studies
Simulate users…and fix them
• Cleverdon’s idea: remove users, but include a
static user component, fixed across experiments
– The judgments in the ground truth
• Remove all sources of variability, except systems
user-measure = f(documents, need, user, system)
Simulate users…and fix them
• Cleverdon’s idea: remove users, but include a
static user component, fixed across experiments
– The judgments in the ground truth
• Remove all sources of variability, except systems
user-measure = f(documents, need, user, system)
user-measure = f(system)
Test collections
user-measure = f(system)
• Test collections are tools to estimate
distributions of user-measures
– Reproducibility becomes possible and easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
Wait a minute
• Are we estimating distributions about users or
distributions about systems?
system-effectiveness = f(system, measure)
• We come up with different distributions of
system-effectiveness, one per measure
• Each measure has its own assumptions
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• Well, at least we assume the correlation
– Are they correlated? How well?
• Test collections: estimators of user distributions
– What we want to measure: user satisfaction
– What we do measure: system effectiveness
Validity and Reliability
• Validity: are we measuring what we want to?
– External validity:
Are topics, documents and assessors representative?
– Construct validity:
Do system-measures correspond to user-measures?
– Conclusion validity:
Is system A really better than system B?
• Reliability: how repeatable are the results?
– How large do collections have to be to ensure
repeatability with a different sample?
Validity
Assumption
• Systems with better effectiveness are perceived
by users as more useful, more satisfactory
• Tricky: different effectiveness measures and
relevance scales give different results
– Which one is better to predict satisfaction?
• The goal is user satisfaction, not system
effectiveness
Mapping
• Try to map system effectiveness onto user
satisfaction, experimentally
• If P@10 = 0.2, how likely is it that the user will
find the results satisfactory?
• What if DCG@20 = 0.467?
• What if ERR = 0.9?
User-oriented System-measures
• Effectiveness measures are generally not
formulated to correlate with user-satisfaction
• If effectiveness is 0, we expect 0% probability of
user satisfaction
• If effectiveness is 1, we expect 100% probability
• If effectiveness is 𝜆, we expect 100𝜆%
• But this is not what we have
Unbounded measures
𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑘
𝑖=1
• Upper bound depends on cutoff, gain function
and relevance scale
– Normalize effectiveness between 0 and 1
– What is the best we can do with 𝑘 documents?
𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑔𝑎𝑖𝑛 𝑟𝑖
∗
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑘
𝑖=1
Recall-oriented measures
𝐴𝑃@𝑘 =
1
ℛ1
𝑟i · 𝑃@𝑖
𝑘
𝑖=1
• 𝐴𝑃@𝑘 = 1 only possible if 𝑘 ≥ ℛ1
• Reformulate towards users
– What is the best we can do with 𝑘 documents,
regardless of the judgments in the ground truth?
𝐴𝑃@𝑘 =
1
𝑘
𝑟𝐴 𝑖
· 𝑃@𝑖
𝑘
𝑖=1
Ideal ranking
𝑛𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
𝑔𝑎𝑖𝑛 𝑖𝑑𝑒𝑎𝑙𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
• If there is only one relevant, 𝑛𝐷𝐶𝐺@10 = 1
even if we retrieve nine nonrelevants
• Assume the ideal ranking has only excellent
documents, with maximum relevance
𝑛𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
𝑔𝑎𝑖𝑛 𝑟𝑖
∗
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
• This is basically user-oriented 𝐷𝐶𝐺@𝑘
Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection
Measures
• All reformulated, user-oriented
– What is the best we can do under the user model?
• Binary
– P, AP, RR
• Graded
– CG, DCG, Q, RBP, ERR, GAP, ADR , EDCG
– Linear and exponential gains
Relevance scales
• Originally used
– Broad: 3 levels
– Fine: 101 levels
• Artificially made from the Fine scale
– Graded with 3, 4 and 5 levels, evenly spaced
– Binary, with threshold equal 20, 40, 60 and 80
Measures and Scales
Measure
Original Artificial Graded Artificial Binary
Broad Fine 𝑛ℒ = 3 𝑛ℒ = 4 𝑛ℒ = 5 ℓ 𝑚𝑖𝑛 = 20 ℓ 𝑚𝑖𝑛 = 40 ℓ 𝑚𝑖𝑛 = 60 ℓ 𝑚𝑖𝑛 = 80
𝑃@5 x x x x
𝐴𝑃@5 x x x x
𝑅𝑅@5 x x x x
𝐶𝐺𝑙@5 x x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5
𝐶𝐺𝑒@5 x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5
𝐷𝐶𝐺𝑙@5 x x x x x x x x x
𝐷𝐶𝐺𝑒@5 x x x x 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5
𝐸𝐷𝐶𝐺𝑙@5 x x x x x x x x x
𝐸𝐷𝐶𝐺𝑒@5 x x x x 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5
𝑄𝑙@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝑄 𝑒@5 x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝑅𝐵𝑃𝑙@5 x x x x x x x x x
𝑅𝐵𝑃𝑒@5 x x x x 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5
𝐸𝑅𝑅𝑙@5 x x x x x x x x x
𝐸𝑅𝑅 𝑒@5 x x x x 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5
𝐺𝐴𝑃@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝐴𝐷𝑅@5 x x x x x x x x
Experimental Design
Experimental Design
user preference
(agrees or disagrees
with effectiveness)
Experimental Design
non-preference
(can’t decide)
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
Data
• Queries, documents and judgments from MIREX
– MIREX: TREC-like evaluation forum in Music IR
• 4,115 unique and artificial examples
– Covering full range of effectiveness
• In 10 bins 0, 0.1 , 0.1, 0.2 , … , [0.9, 1]
– At least 200 examples per measure/scale/bin
• 432 unique queries, 5,636 unique documents
Collecting User Preferences
• Crowdsourcing
– Quality control through trap examples
• Total: 547 unique subjects, 11,042 preferences
• Accepted: 175 subjects, 9,373 preferences
• After trap questions: 113 subjects
Single system: how good is it?
• 2,045 non-preferences (49%)
– 1,056 satisfactory
– 969 non-satisfactory
What do we expect?
Single system: how good is it?
• 2,045 non-preferences (49%)
– 1,056 satisfactory
– 969 non-satisfactory
Linear
mapping
Single system: how good is it?
Large
thresholds
underestimate
satisfaction
Single system: how good is it?
Ranking does
not affect
satisfaction?
Single system: how good is it?
Exponential
gain
underestimates
satisfaction
Single system: how good is it?
• Best adhere to the diagonal
– 𝐶𝐺𝑙@5, 𝐷𝐶𝐺𝑙@5 and 𝑅𝐵𝑃𝑙@5
– Not necessarily better: just easier to interpret
• About 20% bias at endpoints
– Room for improvement with personalization
• Less sensitive to subjectivity in relevance
– Minimize 𝑃(𝑆𝑎𝑡│0) and maximize 𝑃(𝑆𝑎𝑡│1)
– ℓ 𝑚𝑖𝑛 = 40 and 𝐵𝑟𝑜𝑎𝑑 behave better
– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
Two systems: which one is better?
• 2,090 preferences (51%)
– 1,019 for system A
– 1,071 for system B
What do we expect?
Two systems: which one is better?
• 2,090 preferences (51%)
– 1,019 for system A
– 1,071 for system B
Users always
notice the
difference…
…regardless
of how
large it is
Two systems: which one is better?
Need quite
large
differences!
Two systems: which one is better?
More relevance
levels better to
discriminate
Two systems: which one is better?
Bad
correlation?
Two systems: which one is better?
• Users prefer the (supposedly) worse system
User Agrees with Evaluation
• Closer to ideal 𝑃 𝐴𝑔𝑔 = 1 Δ𝜆 = 1
– ℓ 𝑚𝑖𝑛 = 80 better among binaries
– 𝐹𝑖𝑛𝑒 better for linear gain
– 𝑛ℒ = 5 better for exponential gain
– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
User Disagrees with Evaluation
• Closer to ideal 𝑃 𝐴𝑔𝑔 = −1 Δ𝜆 = 0
– ℓ 𝑚𝑖𝑛 = 40 better among binaries
– 𝐹𝑖𝑛𝑒 better for linear gain
– 𝐵𝑟𝑜𝑎𝑑 better with exponential gain
– 𝐶𝐺@5, 𝐺𝐴𝑃@5, 𝐷𝐶𝐺@5 and 𝑅𝐵𝑃@5
Summary
• Linear gain better than exponential gain
– Except, slightly, in terms of disagreements
• Measures oriented to a single document are not
appropriate for a music recommendation setting
• Gain is independent of other documents
• 𝐵𝑟𝑜𝑎𝑑 better to predict satisfaction
• 𝐹𝑖𝑛𝑒 better to predict user agreement
• Binary scales worst overall
Summary
• We can map system effectiveness onto
probability of user satisfaction
• ~20% of users disagree with effectiveness
– Practical upper (and lower) bound in evaluation
– Need to incorporate user profiles
• Somehow included in MSD Challenge
• Δ𝜆 ≈ 0.4 needed for users to agree
– Historically observed only 20% of times in MIREX
– Be careful with statistical significance!
Satisfactionoversamples
User Satisfaction
• So far only for a query and a user (Bernoulli)
– 𝑃 𝑆𝑎𝑡 𝜆 𝑞
• Easily for 𝑛 users (Binomial)
– 𝑃 𝑆𝑎𝑡 𝑛 = 𝑘 𝜆 𝑞
• Example: 𝑄𝑙@5 = 0.61
– 𝑃 𝑆𝑎𝑡 ≈ 0.7
– 𝑃 𝑆𝑎𝑡15 = 10 ≈ 0.21
• What about a sample of queries 𝒬?
User Satisfaction over a Sample
𝐸 𝑃 𝑆𝑎𝑡 =
1
𝑛 𝒬
𝑃 𝑆𝑎𝑡 𝜆 𝑞
𝑞∈𝒬
• Example: satisfaction is underestimated
System Success
• If 𝑃 𝑆𝑎𝑡 ≥ 𝑡𝑕𝑟𝑒𝑠𝑕𝑜𝑙𝑑 the system is successful
• If we want the majority of users to be satisfied
– 𝑃 𝑆𝑢𝑐𝑐 = 1 − F 𝑃 𝑆𝑎𝑡 0.5
• Intuition: improving bad queries is worthier than
further improving good ones
System Success
• Example:
– 𝐸 Δ𝜆 = −0.0021
System Success
• Example:
– 𝐸 Δ𝜆 = −0.0021
– 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011
System Success
• Example:
– 𝐸 Δ𝜆 = −0.0021
– 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011
– 𝐸 Δ𝑃 𝑆𝑢𝑐𝑐 = 0.07
Summary
• Need to consider full distributions
– Always average or good on average?
• Modeling full distribution
– Normal for small query sets, Empirical for large
– Beta always better for 𝐹𝑖𝑛𝑒 scale
Summary
• Intuitive interpretations of effectiveness fail
– Contradictory results in terms of user satisfaction
Reliability
Samples
• Test collections are samples from larger, possibly
infinite, populations
– Documents, queries and users
• Δ𝜆 is just an estimate of the population mean 𝜇Δ𝜆
• How reliable is our conclusion?
Reliability vs Cost
• Building reliable collections is easy
• Just use more documents, queries and assessors
• But it is prohibitively expensive
• Best option is to increase query set size
– Largest source of variability
• How many queries?
– First we need to measure reliability
Data-based approach
1. Randomly split query set
2. Compute indicators of reliability based on
these two query subsets
3. Extrapolate to larger query sets
…with some variations
Data-based reliability indicators
• Compare results with two collections
– Kendall tau correlation
– AP correlation
– Absolute sensitivity
– Relative sensitivity
– Power ratio
– Minor conflict ratio
– Major conflict ratio
– RMSE
Generalizability Theory approach
• Address variability of scores, not just means
• G-study
– Estimate variance components from previous,
representative data
– Usually previous test collections
• D-study
– Estimate reliability based on estimated variance
components from G-study
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal!
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal! query
difficulty
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal! query
difficulty
some systems
better for
some queries
D-study
• Relative stability: 𝐸𝜌2
=
𝜎𝑠
2
𝜎𝑠
2+
𝜎 𝑠:𝑞
2
𝑛 𝑞
′
• Absolute stability: Φ =
𝜎𝑠
2
𝜎𝑠
2+
𝜎 𝑞
2+𝜎 𝑠:𝑞
2
𝑛 𝑞
′
• Easy to estimate how many queries we need to
reach a certain stability level (1MQ track)
– ≈80 queries sufficient for stable rankings
– ≈130 queries for stable absolute scores
G-Theory approach
• How sensitive is the D-study to the initial data
used in the G-study?
• How should we interpret G-Theory indicators in
practice? What does 𝐸𝜌2
= 0.95 mean?
• From the above, review reliability of over 40
TREC test collections
Data
• 43 TREC collections
– From TREC 3 to TREC 2011
• 12 tasks across 10 tracks
– Ad hoc, Web, Novelty, Genomics, Robust, Terabyte,
Enterprise, Million Query, Medical and Microblog
Sensitivity: experiment
• Vary number of queries in G-study
– From 𝑛 𝑞 = 5 to full set
– Use all runs available
• Run D-study
– Compute 𝐸𝜌2 and Φ
– Compute 𝑛 𝑞
′ to reach 0.95 stability
• 200 random trials
Variability due to queries
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.3, depending on
what 10 queries we use
Variability due to queries
Sensitivity: experiment
• Do the same, but vary number of systems
– From 𝑛 𝑠 = 5 to full set
– Use all queries available
• 200 random trials
Variability due to systems
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use
Variability due to systems
Results
• G-Theory is very sensitive to initial data
– Need about 50 queries and 50 systems for differences
in 𝐸𝜌2
and Φ below 0.1
• Number of queries for 𝐸𝜌2
= 0.95 may change
in orders of magnitude
– Microblog2011 (all 184 systems and 30 queries)
• Need 63 to 133 queries
– Medical2011 (all 34 queries and 40 systems)
• Need 109 to 566 queries
Compute confidence intervals
Compute confidence intervals
Account for variability
in initial data
Compute confidence intervals
Required number of
queries to reach the
lower end of the interval
Compute confidence intervals
Summary in TREC
• 𝐸𝜌2
: mean=0.88 sd=0.1
– 95% conf. intervals are 0.1 long
• Φ: mean=0.74 sd=0.2
– 95% conf. intervals are 0.19 long
Interpretation: experiment
• Split query set in 2 subsets
– From 𝑛 𝑞 = 10 to full set / 2
– Use all runs available
• Run D-study
– Compute 𝐸𝜌2
and Φ and map onto 𝜏, sensitivity,
power, conflicts, etc.
• 50 random trials
– Over 28,000 datapoints
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
𝜏 = 0.9 → 𝐸𝜌2
≈ 0.97
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
Million
Query
2007
Million Query 2008
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
Future predictions
• This allows us to make more informed decisions
within a collection
• What about a new collection?
– Fit a single model for each mapping with 90% and
95% prediction intervals
• Assess whether a larger collection is really worth
the effort
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
current collection target
*All mappings in the paper
Example: 𝑬𝝆 𝟐 → 𝝉
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
Summary
• G-Theory is regarded as more appropriate, ease
to use and powerful to assess reliability than the
traditional data-based approaches
• But it is quite sensitive to initial data used to
estimate variance components
– Data-based approaches are too!
• and almost impossible to interpret in practice
Summary
• Need about 50 queries and 50 systems to have
robust estimates of reliability
– That is a whole collection already!
– Need to use confidence intervals
• Previous interpretation overestimated reliability
– 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
– 𝐸𝜌2
= 0.95 → 𝜏 ≈ 0.85
Reliability:reviewofTRECcollections
Outline
• Estimate 𝐸𝜌2
and Φ, with 95% confidence
intervals, and full query set
• Map onto 𝜏, sensitivity, power, conflicts, etc.
• Results within tasks offer a historical perspective
on reliability since 1994
*All collections and mappings in the paper
Example: Ad hoc 3-8
• 𝐸𝜌2
∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]
• 𝑚𝑖𝑛𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.6, 8.2 %
• 𝑚𝑎𝑗𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.02, 1.38 %
• Queries to get 𝐸𝜌2
= 0.95: [37,233]
• Queries to get Φ = 0.95: [116,999]
• 50 queries were used
Example: Web ad hoc
• TREC-8 to TREC-2001: WT2g and WT10g
– 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]
– Queries to get 𝐸𝜌2 = 0.95: 40,220
• TREC-2009 to TREC-2011: ClueWeb09
– 𝐸𝜌2 ∈ 0.8,0.83 → 𝜏 ∈ [0.53,0.59]
– Queries to get 𝐸𝜌2
= 0.95: 107,438
• 50 queries were used
Historical trend
• Decreasing within and across tracks?
Historical trend
• Systems getting better for specific problems?
Historical trend
• Increasing task-specificity in queries?
Historical reliability in TREC
• On average, 𝐸𝜌2
= 0.88 → 𝜏 ≈ 0.7
• Some collections clearly unreliable
– Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
• 50 queries not enough for stable rankings, about
200 are needed in most cases
Implications
• Fixing a minimum number of queries across
tracks is unrealistic
– Not even across editions of the same task
• Need to analyze on a case-by-case basis, while
building the collections
– GT4IReval, R package online
Currentand future work
Validity
• Similar studies in Text IR to map effectiveness
onto user satisfaction
• Particularly interesting because there are several
query types, and users behave differently
– Single measure to use in all cases?
– Use different measures and average them all?
• Further user studies to figure out what makes
users say good and better
• How should test collections be extended to
incorporate more user information?
Reliability
• Study assessor effect
• Study document collection effect
• Better models to map G-theory indicators onto
understandable data-based indicators
• Methods to reliably measure reliability while
building the collection
References
General
• Cleverdon, C. W. (1991). The Significance of the Cranfield Tests on Index Languages. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 3–12).
• Sanderson, M. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends
in Information Retrieval, 4(4), 247–375.
• Robertson, S. (2008). On the History of Evaluation in IR. Journal of Information Science, 34(4), 439–456.
• Harman, D. K. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval,
and Services, 3(2), 1–119.
• Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Workshop of the Cross-Language
Evaluation Forum (pp. 355–370).
• Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information
Processing and Management, 28(4), 467–490.
• Gull, C. D. (1956). Seven Years of Work on the Organisation of Materials in a Special Library. American
Documentation, 7(4), 320–329.
• Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent
Information Systems.
• Urbano, J. (2013). Evaluation in Audio Music Similarity. PhD dissertation, University Carlos III of Madrid.
• Trochim, W. M. K., & Donnelly, J. P. (2007). The Research Methods Knowledge Base (3rd ed.). Atomic Dog
Publishing.
• Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for
Generalized Causal Inference. Houghton-Mifflin.
• Zobel, J., Webber, W., Sanderson, M., & Moffat, A. (2011). Principles for Robust Evaluation Infrastructure. In ACM
CIKM Workshop on Data infrastructures for Supporting Information Retrieval Evaluation.
Validity
• Allan, J., Carterette, B., & Lewis, J. (2005). When Will Information Retrieval Be “Good Enough”? In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 433–440).
• Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User
Satisfaction. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.
773–774).
• Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. (2008). The Good and the Bad System: Does the Test
Collection Predict User’s Effectiveness. In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 59–66).
• Bailey, P., Craswell, N., Soboroff, I., Thomas, P., Vries, A. P. de, & Yilmaz, E. (2008). Relevance Assessment: Are
Judges Exchangeable and Does it Matter? In International ACM SIGIR Conference on Research and Development
in Information Retrieval (pp. 667–674).
• Bennett, P. N., Carterette, B., Chapelle, O., & Joachims, T. (2008). Beyond Binary Relevance: Preferences, Diversity
and Set-Level Judgments. ACM SIGIR Forum, 42(2), 53–58.
• Carterette, B. (2011). System Effectiveness, User Models, and User Utility: A General Framework for
Investigation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.
903–912).
• Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or There: Preference Judgments for
Relevance. In European Conference on Information Retrieval (pp. 16–27).
• Carterette, B., & Soboroff, I. (2010). The Effect of Assessor Error on IR System Evaluation. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 539–546).
• Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User
Evaluations Give the Same Results? In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 17–24).
Validity
• Hersh, W., Turpin, A., Sacherek, L., Olson, D., Price, S., Chan, B., & Kraemer, D. (2000). Further Analysis of
Whether Batch and User Evaluations Give the Same Results With a Question-Answering Task. In Text REtrieval
Conference.
• Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. In
International Society for Music Information Retrieval Conference (pp. 331–336).
• Huffman, S. B., & Hochster, M. (2007). How Well does Result Relevance Predict Session Satisfaction? In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 567–573).
• Ingwersen, P., & Järvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context.
Springer.
• Järvelin, K. (2011). IR Research: Systems, Interaction, Evaluation and Theories. ACM SIGIR Forum, 45(2), 17–31.
• Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science, 48(9),
810–832.
• Sanderson, M., Paramita, M. L., Clough, P., & Kanoulas, E. (2010). Do User Preferences and Evaluation Measures
Line Up? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 555–
562).
• Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal
of Intelligent Information Systems.
• Schedl, M., Stober, S., Gómez, E., Orio, N., & Liem, C. C. S. (2012). User-Aware Music Retrieval. In M. Müller, M.
Goto, & M. Schedl (Eds.), Multimodal Music Processing (pp. 135–156). Dagstuhl Publishing.
• Scholer, F., & Turpin, A. (2008). Relevance Thresholds in System Evaluations. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 693–694).
Validity
• Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that
They Are Unrealistic. In European Workshop on Human-Computer Interaction and Information Retrieval (pp. 11–
12).
• Thom, J. A., & Scholer, F. (2007). A Comparison of Evaluation Measures Given How Users Perform on Search
Tasks. In Australasian Document Computing Symposium (pp. 100–103).
• Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 225–231).
• Turpin, A., & Hersh, W. (2002). User Interface Effects in Past Batch Versus User Experiments. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 431–432).
• Turpin, A., & Scholer, F. (2006). User Performance Versus Precision Measures for Simple Search Tasks. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 11–18).
• Urbano, J., Downie, J. S., Mcfee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of
Audio Music Similarity and Retrieval. In International Society for Music Information Retrieval Conference (pp.
181–186).
Reliability
• Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 Overview. In Text
REtrieval Conference.
• Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million Query Track 2007
Overview. In Text REtrieval Conference.
• Armstrong, T. G., Moffat, A., Webber, W., & Zobel, J. (2009). Improvements that Don’t Add Up: Ad-Hoc Retrieval
Results since 1998. In ACM International Conference on Information and Knowledge Management (pp. 601–610).
• Banks, D., Over, P., & Zhang, N.-F. (1999). Blind Men and Elephants: Six Approaches to TREC data. Information
Retrieval, 1(1-2), 7–34.
• Bodoff, D. (2008). Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and
Management, 44(3), 1117–1145.
• Bodoff, D., & Li, P. (2007). Test Theory for Assessing IR Test Collections. In International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 367–374).
• Brennan, R. L. (2001). Generalizability Theory. Springer.
• Buckley, C., & Voorhees, E. M. (2000). Evaluating Evaluation Measure Stability. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 33–34).
• Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million Query Track 2009 Overview. In Text REtrieval
Conference.
• Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation Over Thousands of Queries. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 651–658).
• Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I Had a Million Queries. In European
Conference on Information Retrieval (pp. 288–300).
• Lin, W.-H., & Hauptmann, A. (2005). Revisiting the Effect of Topic Set Size on Retrieval Error. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 637–638).
Reliability
• Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 533–540).
• Robertson, S., & Kanoulas, E. (2012). On Per-Topic Variance in IR Evaluation. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 891–900).
• Sakai, T. (2007). On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information
Processing and Management, 43(2), 531–548.
• Sanderson, M., & Zobel, J. (2005). Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169).
• Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in Effectiveness Across Sub-collections. In
ACM International Conference on Information and Knowledge Management (pp. 1965–1969).
• Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications.
• Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information
Retrieval Evaluation. In ACM International Conference on Information and Knowledge Management (pp. 623–
632).
• Urbano, J., Marrero, M., & Martín, D. (2013). A Comparison of the Optimality of Statistical Significance Tests for
Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 925–928).
• Urbano, J., Marrero, M., & Martín, D. (2013). On the Measurement of Test Collection Reliability. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 393–402).
• Voorhees, E. M. (2000). Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness.
Information Processing and Management, 36(5), 697–716.
• Voorhees, E. M. (2009). Topic Set Size Redux. In International ACM SIGIR Conference on Research and
Development in Information Retrieval (pp. 806–807).
Reliability
• Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 316–323).
• Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. In ACM International
Conference on Information and Knowledge Management (pp. 571–580).
• Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A New Rank Correlation Coefficient for Information Retrieval. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 587–594).
• Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314).

More Related Content

Similar to Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...Alejandro Bellogin
 
Analytic emperical Mehods
Analytic emperical MehodsAnalytic emperical Mehods
Analytic emperical MehodsM Surendar
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.pptKingSh2
 
ICIS2021 Making the Crowd Wiser: (Re)combination through teaming
ICIS2021 Making the Crowd Wiser: (Re)combination through teamingICIS2021 Making the Crowd Wiser: (Re)combination through teaming
ICIS2021 Making the Crowd Wiser: (Re)combination through teamingssuserb4c6711
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiencesHim Chitchat
 
Design, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptxDesign, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptxLe Hung
 
Data and Information Details and Differences
Data and Information Details and DifferencesData and Information Details and Differences
Data and Information Details and DifferencesSaurabh846965
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 Georgina Tilby
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
Simulation Models as a Research Method.ppt
Simulation Models as a Research Method.pptSimulation Models as a Research Method.ppt
Simulation Models as a Research Method.pptQidiwQidiwQidiw
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability EvaluationSaqib Shehzad
 
evaluation technique uni 2
evaluation technique uni 2evaluation technique uni 2
evaluation technique uni 2vrgokila
 
User Experiments in Human-Computer Interaction
User Experiments in Human-Computer InteractionUser Experiments in Human-Computer Interaction
User Experiments in Human-Computer InteractionDr. Arindam Dey
 

Similar to Validity and Reliability of Cranfield-like Evaluation in Information Retrieval (20)

Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
 
Analytic emperical Mehods
Analytic emperical MehodsAnalytic emperical Mehods
Analytic emperical Mehods
 
Paper prototype evaluation
Paper prototype evaluationPaper prototype evaluation
Paper prototype evaluation
 
Evaluation techniques
Evaluation techniquesEvaluation techniques
Evaluation techniques
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.ppt
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
Human Computer Interaction Evaluation
Human Computer Interaction EvaluationHuman Computer Interaction Evaluation
Human Computer Interaction Evaluation
 
ICIS2021 Making the Crowd Wiser: (Re)combination through teaming
ICIS2021 Making the Crowd Wiser: (Re)combination through teamingICIS2021 Making the Crowd Wiser: (Re)combination through teaming
ICIS2021 Making the Crowd Wiser: (Re)combination through teaming
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiences
 
Design, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptxDesign, Create, Evaluate Process (1).pptx
Design, Create, Evaluate Process (1).pptx
 
Data and Information Details and Differences
Data and Information Details and DifferencesData and Information Details and Differences
Data and Information Details and Differences
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Simulation Models as a Research Method.ppt
Simulation Models as a Research Method.pptSimulation Models as a Research Method.ppt
Simulation Models as a Research Method.ppt
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
 
evaluation technique uni 2
evaluation technique uni 2evaluation technique uni 2
evaluation technique uni 2
 
User Experiments in Human-Computer Interaction
User Experiments in Human-Computer InteractionUser Experiments in Human-Computer Interaction
User Experiments in Human-Computer Interaction
 
MIS Unit-2.pptx
MIS Unit-2.pptxMIS Unit-2.pptx
MIS Unit-2.pptx
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 

More from Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

  • 1. Validity and Reliability of Cranfield-like Evaluation in Information Retrieval Julián Urbano Picture by Tom Parnell Glasgow, Scotland · September 2013
  • 2. Talk outline • Why we want to Evaluate… • …and what we do with Cranfield • Validity: users versus systems • Reliablity: estimating from samples
  • 4. The two questions • How good is my system? – What does good mean? – What is good enough? • Is system A better than system B? – What does better mean? – How much better? • Efficiency? Effectiveness? Ease?
  • 5. Measure user experience • Time to complete task • Idle time • Success rate • Failure rate • Frustration • Ease to learn • Ease to use …and a long etcetera
  • 6. We want to know some distributions • For an arbitrary user, need and document collection, what is the distribution of: • They describe user experience, fully 0 time to complete task none frustration muchsome
  • 7. The big(ger) picture • Different user-measures attempting to assess the same thing: user satisfaction – How likely is it that an arbitrary user, with an arbitrary need (and with an arbitrary document collection) will be satisfied by the system? • This is the ultimate goal: the good, the better
  • 8. The big(ger) question • User satisfaction…as Bernoulli trial • Probability of satisfaction? • Probability that k in n users are satisfied? • Probability of >80% users satisfied? satisfaction yesno
  • 10. Sources of variability user-measure = f(documents, need, user, system) • Try to estimate the user-measure distribution – Sample documents, needs and users – Problematic • Representativeness • Cost • Ethics – Hard to replicate and repeat results
  • 11. Fix samples • Get a (hopefully) good sample and fix it – Document collection – Topic set – A step towards reproducibility • Still have to sample users, but can’t fix them! – Very large source of variability – Hard to replicate and repeat experiments – Complex, costly, ethical issues – Example: ASTIA-Uniterm studies
  • 12. Simulate users…and fix them • Cleverdon’s idea: remove users, but include a static user component, fixed across experiments – The judgments in the ground truth • Remove all sources of variability, except systems user-measure = f(documents, need, user, system)
  • 13. Simulate users…and fix them • Cleverdon’s idea: remove users, but include a static user component, fixed across experiments – The judgments in the ground truth • Remove all sources of variability, except systems user-measure = f(documents, need, user, system) user-measure = f(system)
  • 14. Test collections user-measure = f(system) • Test collections are tools to estimate distributions of user-measures – Reproducibility becomes possible and easy – Experiments are inexpensive (collections are not) – Research becomes systematic
  • 15. Wait a minute • Are we estimating distributions about users or distributions about systems? system-effectiveness = f(system, measure) • We come up with different distributions of system-effectiveness, one per measure • Each measure has its own assumptions
  • 16. Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
  • 17. Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
  • 18. Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
  • 19. Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
  • 20. Assumption • System-measures correspond to user-measures Users Systems Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … P AP RR DCG nDCG ERR GAP Q …
  • 21. Assumption • Well, at least we assume the correlation – Are they correlated? How well? • Test collections: estimators of user distributions – What we want to measure: user satisfaction – What we do measure: system effectiveness
  • 22. Validity and Reliability • Validity: are we measuring what we want to? – External validity: Are topics, documents and assessors representative? – Construct validity: Do system-measures correspond to user-measures? – Conclusion validity: Is system A really better than system B? • Reliability: how repeatable are the results? – How large do collections have to be to ensure repeatability with a different sample?
  • 24. Assumption • Systems with better effectiveness are perceived by users as more useful, more satisfactory • Tricky: different effectiveness measures and relevance scales give different results – Which one is better to predict satisfaction? • The goal is user satisfaction, not system effectiveness
  • 25. Mapping • Try to map system effectiveness onto user satisfaction, experimentally • If P@10 = 0.2, how likely is it that the user will find the results satisfactory? • What if DCG@20 = 0.467? • What if ERR = 0.9?
  • 26. User-oriented System-measures • Effectiveness measures are generally not formulated to correlate with user-satisfaction • If effectiveness is 0, we expect 0% probability of user satisfaction • If effectiveness is 1, we expect 100% probability • If effectiveness is 𝜆, we expect 100𝜆% • But this is not what we have
  • 27. Unbounded measures 𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘 𝑖=1 • Upper bound depends on cutoff, gain function and relevance scale – Normalize effectiveness between 0 and 1 – What is the best we can do with 𝑘 documents? 𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑔𝑎𝑖𝑛 𝑟𝑖 ∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘 𝑖=1
  • 28. Recall-oriented measures 𝐴𝑃@𝑘 = 1 ℛ1 𝑟i · 𝑃@𝑖 𝑘 𝑖=1 • 𝐴𝑃@𝑘 = 1 only possible if 𝑘 ≥ ℛ1 • Reformulate towards users – What is the best we can do with 𝑘 documents, regardless of the judgments in the ground truth? 𝐴𝑃@𝑘 = 1 𝑘 𝑟𝐴 𝑖 · 𝑃@𝑖 𝑘 𝑖=1
  • 29. Ideal ranking 𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 𝑔𝑎𝑖𝑛 𝑖𝑑𝑒𝑎𝑙𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 • If there is only one relevant, 𝑛𝐷𝐶𝐺@10 = 1 even if we retrieve nine nonrelevants • Assume the ideal ranking has only excellent documents, with maximum relevance 𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 𝑔𝑎𝑖𝑛 𝑟𝑖 ∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘 𝑖=1 • This is basically user-oriented 𝐷𝐶𝐺@𝑘
  • 30. Audio Music Similarity • Song as input to system, audio signal • Retrieve songs musically similar to it, by content • Resembles traditional Ad Hoc retrieval in Text IR • (most?) Important task in Music IR – Music recommendation – Playlist generation – Plagiarism detection
  • 31. Measures • All reformulated, user-oriented – What is the best we can do under the user model? • Binary – P, AP, RR • Graded – CG, DCG, Q, RBP, ERR, GAP, ADR , EDCG – Linear and exponential gains
  • 32. Relevance scales • Originally used – Broad: 3 levels – Fine: 101 levels • Artificially made from the Fine scale – Graded with 3, 4 and 5 levels, evenly spaced – Binary, with threshold equal 20, 40, 60 and 80
  • 33. Measures and Scales Measure Original Artificial Graded Artificial Binary Broad Fine 𝑛ℒ = 3 𝑛ℒ = 4 𝑛ℒ = 5 ℓ 𝑚𝑖𝑛 = 20 ℓ 𝑚𝑖𝑛 = 40 ℓ 𝑚𝑖𝑛 = 60 ℓ 𝑚𝑖𝑛 = 80 𝑃@5 x x x x 𝐴𝑃@5 x x x x 𝑅𝑅@5 x x x x 𝐶𝐺𝑙@5 x x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5 𝐶𝐺𝑒@5 x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5 𝐷𝐶𝐺𝑙@5 x x x x x x x x x 𝐷𝐶𝐺𝑒@5 x x x x 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 x x x x x x x x x 𝐸𝐷𝐶𝐺𝑒@5 x x x x 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝑄𝑙@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝑄 𝑒@5 x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝑅𝐵𝑃𝑙@5 x x x x x x x x x 𝑅𝐵𝑃𝑒@5 x x x x 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝐸𝑅𝑅𝑙@5 x x x x x x x x x 𝐸𝑅𝑅 𝑒@5 x x x x 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐺𝐴𝑃@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝐷𝑅@5 x x x x x x x x
  • 35. Experimental Design user preference (agrees or disagrees with effectiveness)
  • 37. What can we infer? • Preference: difference noticed by user – Positive: user agrees with evaluation – Negative: user disagrees with evaluation • Non-preference: difference not noticed by user – Good: both systems are satisfactory – Bad: both systems are not satisfactory
  • 38. Data • Queries, documents and judgments from MIREX – MIREX: TREC-like evaluation forum in Music IR • 4,115 unique and artificial examples – Covering full range of effectiveness • In 10 bins 0, 0.1 , 0.1, 0.2 , … , [0.9, 1] – At least 200 examples per measure/scale/bin • 432 unique queries, 5,636 unique documents
  • 39. Collecting User Preferences • Crowdsourcing – Quality control through trap examples • Total: 547 unique subjects, 11,042 preferences • Accepted: 175 subjects, 9,373 preferences • After trap questions: 113 subjects
  • 40. Single system: how good is it? • 2,045 non-preferences (49%) – 1,056 satisfactory – 969 non-satisfactory What do we expect?
  • 41. Single system: how good is it? • 2,045 non-preferences (49%) – 1,056 satisfactory – 969 non-satisfactory Linear mapping
  • 42. Single system: how good is it? Large thresholds underestimate satisfaction
  • 43. Single system: how good is it? Ranking does not affect satisfaction?
  • 44. Single system: how good is it? Exponential gain underestimates satisfaction
  • 45. Single system: how good is it? • Best adhere to the diagonal – 𝐶𝐺𝑙@5, 𝐷𝐶𝐺𝑙@5 and 𝑅𝐵𝑃𝑙@5 – Not necessarily better: just easier to interpret • About 20% bias at endpoints – Room for improvement with personalization • Less sensitive to subjectivity in relevance – Minimize 𝑃(𝑆𝑎𝑡│0) and maximize 𝑃(𝑆𝑎𝑡│1) – ℓ 𝑚𝑖𝑛 = 40 and 𝐵𝑟𝑜𝑎𝑑 behave better – 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
  • 46. Two systems: which one is better? • 2,090 preferences (51%) – 1,019 for system A – 1,071 for system B What do we expect?
  • 47. Two systems: which one is better? • 2,090 preferences (51%) – 1,019 for system A – 1,071 for system B Users always notice the difference… …regardless of how large it is
  • 48. Two systems: which one is better? Need quite large differences!
  • 49. Two systems: which one is better? More relevance levels better to discriminate
  • 50. Two systems: which one is better? Bad correlation?
  • 51. Two systems: which one is better? • Users prefer the (supposedly) worse system
  • 52. User Agrees with Evaluation • Closer to ideal 𝑃 𝐴𝑔𝑔 = 1 Δ𝜆 = 1 – ℓ 𝑚𝑖𝑛 = 80 better among binaries – 𝐹𝑖𝑛𝑒 better for linear gain – 𝑛ℒ = 5 better for exponential gain – 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
  • 53. User Disagrees with Evaluation • Closer to ideal 𝑃 𝐴𝑔𝑔 = −1 Δ𝜆 = 0 – ℓ 𝑚𝑖𝑛 = 40 better among binaries – 𝐹𝑖𝑛𝑒 better for linear gain – 𝐵𝑟𝑜𝑎𝑑 better with exponential gain – 𝐶𝐺@5, 𝐺𝐴𝑃@5, 𝐷𝐶𝐺@5 and 𝑅𝐵𝑃@5
  • 54. Summary • Linear gain better than exponential gain – Except, slightly, in terms of disagreements • Measures oriented to a single document are not appropriate for a music recommendation setting • Gain is independent of other documents • 𝐵𝑟𝑜𝑎𝑑 better to predict satisfaction • 𝐹𝑖𝑛𝑒 better to predict user agreement • Binary scales worst overall
  • 55. Summary • We can map system effectiveness onto probability of user satisfaction • ~20% of users disagree with effectiveness – Practical upper (and lower) bound in evaluation – Need to incorporate user profiles • Somehow included in MSD Challenge • Δ𝜆 ≈ 0.4 needed for users to agree – Historically observed only 20% of times in MIREX – Be careful with statistical significance!
  • 57. User Satisfaction • So far only for a query and a user (Bernoulli) – 𝑃 𝑆𝑎𝑡 𝜆 𝑞 • Easily for 𝑛 users (Binomial) – 𝑃 𝑆𝑎𝑡 𝑛 = 𝑘 𝜆 𝑞 • Example: 𝑄𝑙@5 = 0.61 – 𝑃 𝑆𝑎𝑡 ≈ 0.7 – 𝑃 𝑆𝑎𝑡15 = 10 ≈ 0.21 • What about a sample of queries 𝒬?
  • 58. User Satisfaction over a Sample 𝐸 𝑃 𝑆𝑎𝑡 = 1 𝑛 𝒬 𝑃 𝑆𝑎𝑡 𝜆 𝑞 𝑞∈𝒬 • Example: satisfaction is underestimated
  • 59. System Success • If 𝑃 𝑆𝑎𝑡 ≥ 𝑡𝑕𝑟𝑒𝑠𝑕𝑜𝑙𝑑 the system is successful • If we want the majority of users to be satisfied – 𝑃 𝑆𝑢𝑐𝑐 = 1 − F 𝑃 𝑆𝑎𝑡 0.5 • Intuition: improving bad queries is worthier than further improving good ones
  • 60. System Success • Example: – 𝐸 Δ𝜆 = −0.0021
  • 61. System Success • Example: – 𝐸 Δ𝜆 = −0.0021 – 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011
  • 62. System Success • Example: – 𝐸 Δ𝜆 = −0.0021 – 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011 – 𝐸 Δ𝑃 𝑆𝑢𝑐𝑐 = 0.07
  • 63. Summary • Need to consider full distributions – Always average or good on average? • Modeling full distribution – Normal for small query sets, Empirical for large – Beta always better for 𝐹𝑖𝑛𝑒 scale
  • 64. Summary • Intuitive interpretations of effectiveness fail – Contradictory results in terms of user satisfaction
  • 66. Samples • Test collections are samples from larger, possibly infinite, populations – Documents, queries and users • Δ𝜆 is just an estimate of the population mean 𝜇Δ𝜆 • How reliable is our conclusion?
  • 67. Reliability vs Cost • Building reliable collections is easy • Just use more documents, queries and assessors • But it is prohibitively expensive • Best option is to increase query set size – Largest source of variability • How many queries? – First we need to measure reliability
  • 68. Data-based approach 1. Randomly split query set 2. Compute indicators of reliability based on these two query subsets 3. Extrapolate to larger query sets …with some variations
  • 69. Data-based reliability indicators • Compare results with two collections – Kendall tau correlation – AP correlation – Absolute sensitivity – Relative sensitivity – Power ratio – Minor conflict ratio – Major conflict ratio – RMSE
  • 70. Generalizability Theory approach • Address variability of scores, not just means • G-study – Estimate variance components from previous, representative data – Usually previous test collections • D-study – Estimate reliability based on estimated variance components from G-study
  • 71. G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance
  • 72. G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal!
  • 73. G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal! query difficulty
  • 74. G-study 𝜎2 = 𝜎𝑠 2 + 𝜎 𝑞 2 + 𝜎𝑠:𝑞 2 • Estimated with Analysis of Variance system differences, our goal! query difficulty some systems better for some queries
  • 75. D-study • Relative stability: 𝐸𝜌2 = 𝜎𝑠 2 𝜎𝑠 2+ 𝜎 𝑠:𝑞 2 𝑛 𝑞 ′ • Absolute stability: Φ = 𝜎𝑠 2 𝜎𝑠 2+ 𝜎 𝑞 2+𝜎 𝑠:𝑞 2 𝑛 𝑞 ′ • Easy to estimate how many queries we need to reach a certain stability level (1MQ track) – ≈80 queries sufficient for stable rankings – ≈130 queries for stable absolute scores
  • 76. G-Theory approach • How sensitive is the D-study to the initial data used in the G-study? • How should we interpret G-Theory indicators in practice? What does 𝐸𝜌2 = 0.95 mean? • From the above, review reliability of over 40 TREC test collections
  • 77. Data • 43 TREC collections – From TREC 3 to TREC 2011 • 12 tasks across 10 tracks – Ad hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog
  • 78. Sensitivity: experiment • Vary number of queries in G-study – From 𝑛 𝑞 = 5 to full set – Use all runs available • Run D-study – Compute 𝐸𝜌2 and Φ – Compute 𝑛 𝑞 ′ to reach 0.95 stability • 200 random trials
  • 80. We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.3, depending on what 10 queries we use Variability due to queries
  • 81. Sensitivity: experiment • Do the same, but vary number of systems – From 𝑛 𝑠 = 5 to full set – Use all queries available • 200 random trials
  • 83. We may get 𝐸𝜌2 = 0.9 or 𝐸𝜌2 = 0.5, depending on what 20 systems we use Variability due to systems
  • 84. Results • G-Theory is very sensitive to initial data – Need about 50 queries and 50 systems for differences in 𝐸𝜌2 and Φ below 0.1 • Number of queries for 𝐸𝜌2 = 0.95 may change in orders of magnitude – Microblog2011 (all 184 systems and 30 queries) • Need 63 to 133 queries – Medical2011 (all 34 queries and 40 systems) • Need 109 to 566 queries
  • 87. Account for variability in initial data Compute confidence intervals
  • 88. Required number of queries to reach the lower end of the interval Compute confidence intervals
  • 89. Summary in TREC • 𝐸𝜌2 : mean=0.88 sd=0.1 – 95% conf. intervals are 0.1 long • Φ: mean=0.74 sd=0.2 – 95% conf. intervals are 0.19 long
  • 90. Interpretation: experiment • Split query set in 2 subsets – From 𝑛 𝑞 = 10 to full set / 2 – Use all runs available • Run D-study – Compute 𝐸𝜌2 and Φ and map onto 𝜏, sensitivity, power, conflicts, etc. • 50 random trials – Over 28,000 datapoints
  • 91. *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 92. 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 93. 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 94. Million Query 2007 Million Query 2008 *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 95. Future predictions • This allows us to make more informed decisions within a collection • What about a new collection? – Fit a single model for each mapping with 90% and 95% prediction intervals • Assess whether a larger collection is really worth the effort
  • 96. *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 97. current collection *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 98. current collection target *All mappings in the paper Example: 𝑬𝝆 𝟐 → 𝝉
  • 99. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 100. Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
  • 101. Summary • G-Theory is regarded as more appropriate, ease to use and powerful to assess reliability than the traditional data-based approaches • But it is quite sensitive to initial data used to estimate variance components – Data-based approaches are too! • and almost impossible to interpret in practice
  • 102. Summary • Need about 50 queries and 50 systems to have robust estimates of reliability – That is a whole collection already! – Need to use confidence intervals • Previous interpretation overestimated reliability – 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97 – 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
  • 104. Outline • Estimate 𝐸𝜌2 and Φ, with 95% confidence intervals, and full query set • Map onto 𝜏, sensitivity, power, conflicts, etc. • Results within tasks offer a historical perspective on reliability since 1994
  • 105. *All collections and mappings in the paper Example: Ad hoc 3-8 • 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81] • 𝑚𝑖𝑛𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.6, 8.2 % • 𝑚𝑎𝑗𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.02, 1.38 % • Queries to get 𝐸𝜌2 = 0.95: [37,233] • Queries to get Φ = 0.95: [116,999] • 50 queries were used
  • 106. Example: Web ad hoc • TREC-8 to TREC-2001: WT2g and WT10g – 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81] – Queries to get 𝐸𝜌2 = 0.95: 40,220 • TREC-2009 to TREC-2011: ClueWeb09 – 𝐸𝜌2 ∈ 0.8,0.83 → 𝜏 ∈ [0.53,0.59] – Queries to get 𝐸𝜌2 = 0.95: 107,438 • 50 queries were used
  • 107. Historical trend • Decreasing within and across tracks?
  • 108. Historical trend • Systems getting better for specific problems?
  • 109. Historical trend • Increasing task-specificity in queries?
  • 110. Historical reliability in TREC • On average, 𝐸𝜌2 = 0.88 → 𝜏 ≈ 0.7 • Some collections clearly unreliable – Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 • 50 queries not enough for stable rankings, about 200 are needed in most cases
  • 111. Implications • Fixing a minimum number of queries across tracks is unrealistic – Not even across editions of the same task • Need to analyze on a case-by-case basis, while building the collections – GT4IReval, R package online
  • 113. Validity • Similar studies in Text IR to map effectiveness onto user satisfaction • Particularly interesting because there are several query types, and users behave differently – Single measure to use in all cases? – Use different measures and average them all? • Further user studies to figure out what makes users say good and better • How should test collections be extended to incorporate more user information?
  • 114. Reliability • Study assessor effect • Study document collection effect • Better models to map G-theory indicators onto understandable data-based indicators • Methods to reliably measure reliability while building the collection
  • 116. General • Cleverdon, C. W. (1991). The Significance of the Cranfield Tests on Index Languages. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12). • Sanderson, M. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4), 247–375. • Robertson, S. (2008). On the History of Evaluation in IR. Journal of Information Science, 34(4), 439–456. • Harman, D. K. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119. • Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Workshop of the Cross-Language Evaluation Forum (pp. 355–370). • Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management, 28(4), 467–490. • Gull, C. D. (1956). Seven Years of Work on the Organisation of Materials in a Special Library. American Documentation, 7(4), 320–329. • Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems. • Urbano, J. (2013). Evaluation in Audio Music Similarity. PhD dissertation, University Carlos III of Madrid. • Trochim, W. M. K., & Donnelly, J. P. (2007). The Research Methods Knowledge Base (3rd ed.). Atomic Dog Publishing. • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin. • Zobel, J., Webber, W., Sanderson, M., & Moffat, A. (2011). Principles for Robust Evaluation Infrastructure. In ACM CIKM Workshop on Data infrastructures for Supporting Information Retrieval Evaluation.
  • 117. Validity • Allan, J., Carterette, B., & Lewis, J. (2005). When Will Information Retrieval Be “Good Enough”? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 433–440). • Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User Satisfaction. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 773–774). • Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. (2008). The Good and the Bad System: Does the Test Collection Predict User’s Effectiveness. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 59–66). • Bailey, P., Craswell, N., Soboroff, I., Thomas, P., Vries, A. P. de, & Yilmaz, E. (2008). Relevance Assessment: Are Judges Exchangeable and Does it Matter? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 667–674). • Bennett, P. N., Carterette, B., Chapelle, O., & Joachims, T. (2008). Beyond Binary Relevance: Preferences, Diversity and Set-Level Judgments. ACM SIGIR Forum, 42(2), 53–58. • Carterette, B. (2011). System Effectiveness, User Models, and User Utility: A General Framework for Investigation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 903–912). • Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or There: Preference Judgments for Relevance. In European Conference on Information Retrieval (pp. 16–27). • Carterette, B., & Soboroff, I. (2010). The Effect of Assessor Error on IR System Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 539–546). • Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User Evaluations Give the Same Results? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 17–24).
  • 118. Validity • Hersh, W., Turpin, A., Sacherek, L., Olson, D., Price, S., Chan, B., & Kraemer, D. (2000). Further Analysis of Whether Batch and User Evaluations Give the Same Results With a Question-Answering Task. In Text REtrieval Conference. • Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. In International Society for Music Information Retrieval Conference (pp. 331–336). • Huffman, S. B., & Hochster, M. (2007). How Well does Result Relevance Predict Session Satisfaction? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 567–573). • Ingwersen, P., & Järvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. • Järvelin, K. (2011). IR Research: Systems, Interaction, Evaluation and Theories. ACM SIGIR Forum, 45(2), 17–31. • Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science, 48(9), 810–832. • Sanderson, M., Paramita, M. L., Clough, P., & Kanoulas, E. (2010). Do User Preferences and Evaluation Measures Line Up? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 555– 562). • Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal of Intelligent Information Systems. • Schedl, M., Stober, S., Gómez, E., Orio, N., & Liem, C. C. S. (2012). User-Aware Music Retrieval. In M. Müller, M. Goto, & M. Schedl (Eds.), Multimodal Music Processing (pp. 135–156). Dagstuhl Publishing. • Scholer, F., & Turpin, A. (2008). Relevance Thresholds in System Evaluations. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 693–694).
  • 119. Validity • Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that They Are Unrealistic. In European Workshop on Human-Computer Interaction and Information Retrieval (pp. 11– 12). • Thom, J. A., & Scholer, F. (2007). A Comparison of Evaluation Measures Given How Users Perform on Search Tasks. In Australasian Document Computing Symposium (pp. 100–103). • Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 225–231). • Turpin, A., & Hersh, W. (2002). User Interface Effects in Past Batch Versus User Experiments. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 431–432). • Turpin, A., & Scholer, F. (2006). User Performance Versus Precision Measures for Simple Search Tasks. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 11–18). • Urbano, J., Downie, J. S., Mcfee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval. In International Society for Music Information Retrieval Conference (pp. 181–186).
  • 120. Reliability • Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 Overview. In Text REtrieval Conference. • Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million Query Track 2007 Overview. In Text REtrieval Conference. • Armstrong, T. G., Moffat, A., Webber, W., & Zobel, J. (2009). Improvements that Don’t Add Up: Ad-Hoc Retrieval Results since 1998. In ACM International Conference on Information and Knowledge Management (pp. 601–610). • Banks, D., Over, P., & Zhang, N.-F. (1999). Blind Men and Elephants: Six Approaches to TREC data. Information Retrieval, 1(1-2), 7–34. • Bodoff, D. (2008). Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management, 44(3), 1117–1145. • Bodoff, D., & Li, P. (2007). Test Theory for Assessing IR Test Collections. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 367–374). • Brennan, R. L. (2001). Generalizability Theory. Springer. • Buckley, C., & Voorhees, E. M. (2000). Evaluating Evaluation Measure Stability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 33–34). • Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million Query Track 2009 Overview. In Text REtrieval Conference. • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation Over Thousands of Queries. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 651–658). • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I Had a Million Queries. In European Conference on Information Retrieval (pp. 288–300). • Lin, W.-H., & Hauptmann, A. (2005). Revisiting the Effect of Topic Set Size on Retrieval Error. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 637–638).
  • 121. Reliability • Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 533–540). • Robertson, S., & Kanoulas, E. (2012). On Per-Topic Variance in IR Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 891–900). • Sakai, T. (2007). On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information Processing and Management, 43(2), 531–548. • Sanderson, M., & Zobel, J. (2005). Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169). • Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in Effectiveness Across Sub-collections. In ACM International Conference on Information and Knowledge Management (pp. 1965–1969). • Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications. • Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In ACM International Conference on Information and Knowledge Management (pp. 623– 632). • Urbano, J., Marrero, M., & Martín, D. (2013). A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 925–928). • Urbano, J., Marrero, M., & Martín, D. (2013). On the Measurement of Test Collection Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 393–402). • Voorhees, E. M. (2000). Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Information Processing and Management, 36(5), 697–716. • Voorhees, E. M. (2009). Topic Set Size Redux. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 806–807).
  • 122. Reliability • Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 316–323). • Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. In ACM International Conference on Information and Knowledge Management (pp. 571–580). • Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A New Rank Correlation Coefficient for Information Retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 587–594). • Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314).