The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease?
Measure user experience
• Time to complete task
• Idle time
• Success rate
• Failure rate
• Frustration
• Ease to learn
• Ease to use
…and a long etcetera
We want to know some distributions
• For an arbitrary user, need and document
collection, what is the distribution of:
• They describe user experience, fully
0
time to complete task
none
frustration
muchsome
The big(ger) picture
• Different user-measures attempting to assess the
same thing: user satisfaction
– How likely is it that an arbitrary user, with an arbitrary
need (and with an arbitrary document collection) will
be satisfied by the system?
• This is the ultimate goal: the good, the better
The big(ger) question
• User satisfaction…as Bernoulli trial
• Probability of satisfaction?
• Probability that k in n users are satisfied?
• Probability of >80% users satisfied?
satisfaction
yesno
Sources of variability
user-measure = f(documents, need, user, system)
• Try to estimate the user-measure distribution
– Sample documents, needs and users
– Problematic
• Representativeness
• Cost
• Ethics
– Hard to replicate and repeat results
Fix samples
• Get a (hopefully) good sample and fix it
– Document collection
– Topic set
– A step towards reproducibility
• Still have to sample users, but can’t fix them!
– Very large source of variability
– Hard to replicate and repeat experiments
– Complex, costly, ethical issues
– Example: ASTIA-Uniterm studies
Simulate users…and fix them
• Cleverdon’s idea: remove users, but include a
static user component, fixed across experiments
– The judgments in the ground truth
• Remove all sources of variability, except systems
user-measure = f(documents, need, user, system)
Simulate users…and fix them
• Cleverdon’s idea: remove users, but include a
static user component, fixed across experiments
– The judgments in the ground truth
• Remove all sources of variability, except systems
user-measure = f(documents, need, user, system)
user-measure = f(system)
Test collections
user-measure = f(system)
• Test collections are tools to estimate
distributions of user-measures
– Reproducibility becomes possible and easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
Wait a minute
• Are we estimating distributions about users or
distributions about systems?
system-effectiveness = f(system, measure)
• We come up with different distributions of
system-effectiveness, one per measure
• Each measure has its own assumptions
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• System-measures correspond to user-measures
Users Systems
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…
Assumption
• Well, at least we assume the correlation
– Are they correlated? How well?
• Test collections: estimators of user distributions
– What we want to measure: user satisfaction
– What we do measure: system effectiveness
Validity and Reliability
• Validity: are we measuring what we want to?
– External validity:
Are topics, documents and assessors representative?
– Construct validity:
Do system-measures correspond to user-measures?
– Conclusion validity:
Is system A really better than system B?
• Reliability: how repeatable are the results?
– How large do collections have to be to ensure
repeatability with a different sample?
Assumption
• Systems with better effectiveness are perceived
by users as more useful, more satisfactory
• Tricky: different effectiveness measures and
relevance scales give different results
– Which one is better to predict satisfaction?
• The goal is user satisfaction, not system
effectiveness
Mapping
• Try to map system effectiveness onto user
satisfaction, experimentally
• If P@10 = 0.2, how likely is it that the user will
find the results satisfactory?
• What if DCG@20 = 0.467?
• What if ERR = 0.9?
User-oriented System-measures
• Effectiveness measures are generally not
formulated to correlate with user-satisfaction
• If effectiveness is 0, we expect 0% probability of
user satisfaction
• If effectiveness is 1, we expect 100% probability
• If effectiveness is 𝜆, we expect 100𝜆%
• But this is not what we have
Unbounded measures
𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑘
𝑖=1
• Upper bound depends on cutoff, gain function
and relevance scale
– Normalize effectiveness between 0 and 1
– What is the best we can do with 𝑘 documents?
𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑔𝑎𝑖𝑛 𝑟𝑖
∗
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖
𝑘
𝑖=1
Recall-oriented measures
𝐴𝑃@𝑘 =
1
ℛ1
𝑟i · 𝑃@𝑖
𝑘
𝑖=1
• 𝐴𝑃@𝑘 = 1 only possible if 𝑘 ≥ ℛ1
• Reformulate towards users
– What is the best we can do with 𝑘 documents,
regardless of the judgments in the ground truth?
𝐴𝑃@𝑘 =
1
𝑘
𝑟𝐴 𝑖
· 𝑃@𝑖
𝑘
𝑖=1
Ideal ranking
𝑛𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
𝑔𝑎𝑖𝑛 𝑖𝑑𝑒𝑎𝑙𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
• If there is only one relevant, 𝑛𝐷𝐶𝐺@10 = 1
even if we retrieve nine nonrelevants
• Assume the ideal ranking has only excellent
documents, with maximum relevance
𝑛𝐷𝐶𝐺@𝑘 =
𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
𝑔𝑎𝑖𝑛 𝑟𝑖
∗
𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖𝑘
𝑖=1
• This is basically user-oriented 𝐷𝐶𝐺@𝑘
Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection
Measures
• All reformulated, user-oriented
– What is the best we can do under the user model?
• Binary
– P, AP, RR
• Graded
– CG, DCG, Q, RBP, ERR, GAP, ADR , EDCG
– Linear and exponential gains
Relevance scales
• Originally used
– Broad: 3 levels
– Fine: 101 levels
• Artificially made from the Fine scale
– Graded with 3, 4 and 5 levels, evenly spaced
– Binary, with threshold equal 20, 40, 60 and 80
Measures and Scales
Measure
Original Artificial Graded Artificial Binary
Broad Fine 𝑛ℒ = 3 𝑛ℒ = 4 𝑛ℒ = 5 ℓ 𝑚𝑖𝑛 = 20 ℓ 𝑚𝑖𝑛 = 40 ℓ 𝑚𝑖𝑛 = 60 ℓ 𝑚𝑖𝑛 = 80
𝑃@5 x x x x
𝐴𝑃@5 x x x x
𝑅𝑅@5 x x x x
𝐶𝐺𝑙@5 x x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5
𝐶𝐺𝑒@5 x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5
𝐷𝐶𝐺𝑙@5 x x x x x x x x x
𝐷𝐶𝐺𝑒@5 x x x x 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5
𝐸𝐷𝐶𝐺𝑙@5 x x x x x x x x x
𝐸𝐷𝐶𝐺𝑒@5 x x x x 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5
𝑄𝑙@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝑄 𝑒@5 x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝑅𝐵𝑃𝑙@5 x x x x x x x x x
𝑅𝐵𝑃𝑒@5 x x x x 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5
𝐸𝑅𝑅𝑙@5 x x x x x x x x x
𝐸𝑅𝑅 𝑒@5 x x x x 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5
𝐺𝐴𝑃@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5
𝐴𝐷𝑅@5 x x x x x x x x
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
Data
• Queries, documents and judgments from MIREX
– MIREX: TREC-like evaluation forum in Music IR
• 4,115 unique and artificial examples
– Covering full range of effectiveness
• In 10 bins 0, 0.1 , 0.1, 0.2 , … , [0.9, 1]
– At least 200 examples per measure/scale/bin
• 432 unique queries, 5,636 unique documents
Collecting User Preferences
• Crowdsourcing
– Quality control through trap examples
• Total: 547 unique subjects, 11,042 preferences
• Accepted: 175 subjects, 9,373 preferences
• After trap questions: 113 subjects
Single system: how good is it?
• 2,045 non-preferences (49%)
– 1,056 satisfactory
– 969 non-satisfactory
What do we expect?
Single system: how good is it?
• 2,045 non-preferences (49%)
– 1,056 satisfactory
– 969 non-satisfactory
Linear
mapping
Single system: how good is it?
• Best adhere to the diagonal
– 𝐶𝐺𝑙@5, 𝐷𝐶𝐺𝑙@5 and 𝑅𝐵𝑃𝑙@5
– Not necessarily better: just easier to interpret
• About 20% bias at endpoints
– Room for improvement with personalization
• Less sensitive to subjectivity in relevance
– Minimize 𝑃(𝑆𝑎𝑡│0) and maximize 𝑃(𝑆𝑎𝑡│1)
– ℓ 𝑚𝑖𝑛 = 40 and 𝐵𝑟𝑜𝑎𝑑 behave better
– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
Two systems: which one is better?
• 2,090 preferences (51%)
– 1,019 for system A
– 1,071 for system B
What do we expect?
Two systems: which one is better?
• 2,090 preferences (51%)
– 1,019 for system A
– 1,071 for system B
Users always
notice the
difference…
…regardless
of how
large it is
Two systems: which one is better?
• Users prefer the (supposedly) worse system
User Agrees with Evaluation
• Closer to ideal 𝑃 𝐴𝑔𝑔 = 1 Δ𝜆 = 1
– ℓ 𝑚𝑖𝑛 = 80 better among binaries
– 𝐹𝑖𝑛𝑒 better for linear gain
– 𝑛ℒ = 5 better for exponential gain
– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5
User Disagrees with Evaluation
• Closer to ideal 𝑃 𝐴𝑔𝑔 = −1 Δ𝜆 = 0
– ℓ 𝑚𝑖𝑛 = 40 better among binaries
– 𝐹𝑖𝑛𝑒 better for linear gain
– 𝐵𝑟𝑜𝑎𝑑 better with exponential gain
– 𝐶𝐺@5, 𝐺𝐴𝑃@5, 𝐷𝐶𝐺@5 and 𝑅𝐵𝑃@5
Summary
• Linear gain better than exponential gain
– Except, slightly, in terms of disagreements
• Measures oriented to a single document are not
appropriate for a music recommendation setting
• Gain is independent of other documents
• 𝐵𝑟𝑜𝑎𝑑 better to predict satisfaction
• 𝐹𝑖𝑛𝑒 better to predict user agreement
• Binary scales worst overall
Summary
• We can map system effectiveness onto
probability of user satisfaction
• ~20% of users disagree with effectiveness
– Practical upper (and lower) bound in evaluation
– Need to incorporate user profiles
• Somehow included in MSD Challenge
• Δ𝜆 ≈ 0.4 needed for users to agree
– Historically observed only 20% of times in MIREX
– Be careful with statistical significance!
User Satisfaction
• So far only for a query and a user (Bernoulli)
– 𝑃 𝑆𝑎𝑡 𝜆 𝑞
• Easily for 𝑛 users (Binomial)
– 𝑃 𝑆𝑎𝑡 𝑛 = 𝑘 𝜆 𝑞
• Example: 𝑄𝑙@5 = 0.61
– 𝑃 𝑆𝑎𝑡 ≈ 0.7
– 𝑃 𝑆𝑎𝑡15 = 10 ≈ 0.21
• What about a sample of queries 𝒬?
User Satisfaction over a Sample
𝐸 𝑃 𝑆𝑎𝑡 =
1
𝑛 𝒬
𝑃 𝑆𝑎𝑡 𝜆 𝑞
𝑞∈𝒬
• Example: satisfaction is underestimated
System Success
• If 𝑃 𝑆𝑎𝑡 ≥ 𝑡𝑟𝑒𝑠𝑜𝑙𝑑 the system is successful
• If we want the majority of users to be satisfied
– 𝑃 𝑆𝑢𝑐𝑐 = 1 − F 𝑃 𝑆𝑎𝑡 0.5
• Intuition: improving bad queries is worthier than
further improving good ones
Summary
• Need to consider full distributions
– Always average or good on average?
• Modeling full distribution
– Normal for small query sets, Empirical for large
– Beta always better for 𝐹𝑖𝑛𝑒 scale
Samples
• Test collections are samples from larger, possibly
infinite, populations
– Documents, queries and users
• Δ𝜆 is just an estimate of the population mean 𝜇Δ𝜆
• How reliable is our conclusion?
Reliability vs Cost
• Building reliable collections is easy
• Just use more documents, queries and assessors
• But it is prohibitively expensive
• Best option is to increase query set size
– Largest source of variability
• How many queries?
– First we need to measure reliability
Data-based approach
1. Randomly split query set
2. Compute indicators of reliability based on
these two query subsets
3. Extrapolate to larger query sets
…with some variations
Data-based reliability indicators
• Compare results with two collections
– Kendall tau correlation
– AP correlation
– Absolute sensitivity
– Relative sensitivity
– Power ratio
– Minor conflict ratio
– Major conflict ratio
– RMSE
Generalizability Theory approach
• Address variability of scores, not just means
• G-study
– Estimate variance components from previous,
representative data
– Usually previous test collections
• D-study
– Estimate reliability based on estimated variance
components from G-study
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal!
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal! query
difficulty
G-study
𝜎2 = 𝜎𝑠
2 + 𝜎 𝑞
2 + 𝜎𝑠:𝑞
2
• Estimated with Analysis of Variance
system
differences,
our goal! query
difficulty
some systems
better for
some queries
D-study
• Relative stability: 𝐸𝜌2
=
𝜎𝑠
2
𝜎𝑠
2+
𝜎 𝑠:𝑞
2
𝑛 𝑞
′
• Absolute stability: Φ =
𝜎𝑠
2
𝜎𝑠
2+
𝜎 𝑞
2+𝜎 𝑠:𝑞
2
𝑛 𝑞
′
• Easy to estimate how many queries we need to
reach a certain stability level (1MQ track)
– ≈80 queries sufficient for stable rankings
– ≈130 queries for stable absolute scores
G-Theory approach
• How sensitive is the D-study to the initial data
used in the G-study?
• How should we interpret G-Theory indicators in
practice? What does 𝐸𝜌2
= 0.95 mean?
• From the above, review reliability of over 40
TREC test collections
Data
• 43 TREC collections
– From TREC 3 to TREC 2011
• 12 tasks across 10 tracks
– Ad hoc, Web, Novelty, Genomics, Robust, Terabyte,
Enterprise, Million Query, Medical and Microblog
Sensitivity: experiment
• Vary number of queries in G-study
– From 𝑛 𝑞 = 5 to full set
– Use all runs available
• Run D-study
– Compute 𝐸𝜌2 and Φ
– Compute 𝑛 𝑞
′ to reach 0.95 stability
• 200 random trials
We may get 𝐸𝜌2 = 0.9 or
𝐸𝜌2 = 0.5, depending on
what 20 systems we use
Variability due to systems
Results
• G-Theory is very sensitive to initial data
– Need about 50 queries and 50 systems for differences
in 𝐸𝜌2
and Φ below 0.1
• Number of queries for 𝐸𝜌2
= 0.95 may change
in orders of magnitude
– Microblog2011 (all 184 systems and 30 queries)
• Need 63 to 133 queries
– Medical2011 (all 34 queries and 40 systems)
• Need 109 to 566 queries
Summary in TREC
• 𝐸𝜌2
: mean=0.88 sd=0.1
– 95% conf. intervals are 0.1 long
• Φ: mean=0.74 sd=0.2
– 95% conf. intervals are 0.19 long
Interpretation: experiment
• Split query set in 2 subsets
– From 𝑛 𝑞 = 10 to full set / 2
– Use all runs available
• Run D-study
– Compute 𝐸𝜌2
and Φ and map onto 𝜏, sensitivity,
power, conflicts, etc.
• 50 random trials
– Over 28,000 datapoints
Future predictions
• This allows us to make more informed decisions
within a collection
• What about a new collection?
– Fit a single model for each mapping with 90% and
95% prediction intervals
• Assess whether a larger collection is really worth
the effort
Summary
• G-Theory is regarded as more appropriate, ease
to use and powerful to assess reliability than the
traditional data-based approaches
• But it is quite sensitive to initial data used to
estimate variance components
– Data-based approaches are too!
• and almost impossible to interpret in practice
Summary
• Need about 50 queries and 50 systems to have
robust estimates of reliability
– That is a whole collection already!
– Need to use confidence intervals
• Previous interpretation overestimated reliability
– 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
– 𝐸𝜌2
= 0.95 → 𝜏 ≈ 0.85
Outline
• Estimate 𝐸𝜌2
and Φ, with 95% confidence
intervals, and full query set
• Map onto 𝜏, sensitivity, power, conflicts, etc.
• Results within tasks offer a historical perspective
on reliability since 1994
*All collections and mappings in the paper
Example: Ad hoc 3-8
• 𝐸𝜌2
∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]
• 𝑚𝑖𝑛𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.6, 8.2 %
• 𝑚𝑎𝑗𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.02, 1.38 %
• Queries to get 𝐸𝜌2
= 0.95: [37,233]
• Queries to get Φ = 0.95: [116,999]
• 50 queries were used
Example: Web ad hoc
• TREC-8 to TREC-2001: WT2g and WT10g
– 𝐸𝜌2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]
– Queries to get 𝐸𝜌2 = 0.95: 40,220
• TREC-2009 to TREC-2011: ClueWeb09
– 𝐸𝜌2 ∈ 0.8,0.83 → 𝜏 ∈ [0.53,0.59]
– Queries to get 𝐸𝜌2
= 0.95: 107,438
• 50 queries were used
Historical reliability in TREC
• On average, 𝐸𝜌2
= 0.88 → 𝜏 ≈ 0.7
• Some collections clearly unreliable
– Web Distillation 2003, Genomics 2005, Terabyte 2006,
Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
• 50 queries not enough for stable rankings, about
200 are needed in most cases
Implications
• Fixing a minimum number of queries across
tracks is unrealistic
– Not even across editions of the same task
• Need to analyze on a case-by-case basis, while
building the collections
– GT4IReval, R package online
Validity
• Similar studies in Text IR to map effectiveness
onto user satisfaction
• Particularly interesting because there are several
query types, and users behave differently
– Single measure to use in all cases?
– Use different measures and average them all?
• Further user studies to figure out what makes
users say good and better
• How should test collections be extended to
incorporate more user information?
Reliability
• Study assessor effect
• Study document collection effect
• Better models to map G-theory indicators onto
understandable data-based indicators
• Methods to reliably measure reliability while
building the collection
General
• Cleverdon, C. W. (1991). The Significance of the Cranfield Tests on Index Languages. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 3–12).
• Sanderson, M. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends
in Information Retrieval, 4(4), 247–375.
• Robertson, S. (2008). On the History of Evaluation in IR. Journal of Information Science, 34(4), 439–456.
• Harman, D. K. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval,
and Services, 3(2), 1–119.
• Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Workshop of the Cross-Language
Evaluation Forum (pp. 355–370).
• Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information
Processing and Management, 28(4), 467–490.
• Gull, C. D. (1956). Seven Years of Work on the Organisation of Materials in a Special Library. American
Documentation, 7(4), 320–329.
• Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent
Information Systems.
• Urbano, J. (2013). Evaluation in Audio Music Similarity. PhD dissertation, University Carlos III of Madrid.
• Trochim, W. M. K., & Donnelly, J. P. (2007). The Research Methods Knowledge Base (3rd ed.). Atomic Dog
Publishing.
• Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for
Generalized Causal Inference. Houghton-Mifflin.
• Zobel, J., Webber, W., Sanderson, M., & Moffat, A. (2011). Principles for Robust Evaluation Infrastructure. In ACM
CIKM Workshop on Data infrastructures for Supporting Information Retrieval Evaluation.
Validity
• Allan, J., Carterette, B., & Lewis, J. (2005). When Will Information Retrieval Be “Good Enough”? In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 433–440).
• Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User
Satisfaction. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.
773–774).
• Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. (2008). The Good and the Bad System: Does the Test
Collection Predict User’s Effectiveness. In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 59–66).
• Bailey, P., Craswell, N., Soboroff, I., Thomas, P., Vries, A. P. de, & Yilmaz, E. (2008). Relevance Assessment: Are
Judges Exchangeable and Does it Matter? In International ACM SIGIR Conference on Research and Development
in Information Retrieval (pp. 667–674).
• Bennett, P. N., Carterette, B., Chapelle, O., & Joachims, T. (2008). Beyond Binary Relevance: Preferences, Diversity
and Set-Level Judgments. ACM SIGIR Forum, 42(2), 53–58.
• Carterette, B. (2011). System Effectiveness, User Models, and User Utility: A General Framework for
Investigation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.
903–912).
• Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or There: Preference Judgments for
Relevance. In European Conference on Information Retrieval (pp. 16–27).
• Carterette, B., & Soboroff, I. (2010). The Effect of Assessor Error on IR System Evaluation. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 539–546).
• Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User
Evaluations Give the Same Results? In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 17–24).
Validity
• Hersh, W., Turpin, A., Sacherek, L., Olson, D., Price, S., Chan, B., & Kraemer, D. (2000). Further Analysis of
Whether Batch and User Evaluations Give the Same Results With a Question-Answering Task. In Text REtrieval
Conference.
• Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. In
International Society for Music Information Retrieval Conference (pp. 331–336).
• Huffman, S. B., & Hochster, M. (2007). How Well does Result Relevance Predict Session Satisfaction? In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 567–573).
• Ingwersen, P., & Järvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context.
Springer.
• Järvelin, K. (2011). IR Research: Systems, Interaction, Evaluation and Theories. ACM SIGIR Forum, 45(2), 17–31.
• Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science, 48(9),
810–832.
• Sanderson, M., Paramita, M. L., Clough, P., & Kanoulas, E. (2010). Do User Preferences and Evaluation Measures
Line Up? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 555–
562).
• Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal
of Intelligent Information Systems.
• Schedl, M., Stober, S., Gómez, E., Orio, N., & Liem, C. C. S. (2012). User-Aware Music Retrieval. In M. Müller, M.
Goto, & M. Schedl (Eds.), Multimodal Music Processing (pp. 135–156). Dagstuhl Publishing.
• Scholer, F., & Turpin, A. (2008). Relevance Thresholds in System Evaluations. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 693–694).
Validity
• Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that
They Are Unrealistic. In European Workshop on Human-Computer Interaction and Information Retrieval (pp. 11–
12).
• Thom, J. A., & Scholer, F. (2007). A Comparison of Evaluation Measures Given How Users Perform on Search
Tasks. In Australasian Document Computing Symposium (pp. 100–103).
• Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 225–231).
• Turpin, A., & Hersh, W. (2002). User Interface Effects in Past Batch Versus User Experiments. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 431–432).
• Turpin, A., & Scholer, F. (2006). User Performance Versus Precision Measures for Simple Search Tasks. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 11–18).
• Urbano, J., Downie, J. S., Mcfee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of
Audio Music Similarity and Retrieval. In International Society for Music Information Retrieval Conference (pp.
181–186).
Reliability
• Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 Overview. In Text
REtrieval Conference.
• Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million Query Track 2007
Overview. In Text REtrieval Conference.
• Armstrong, T. G., Moffat, A., Webber, W., & Zobel, J. (2009). Improvements that Don’t Add Up: Ad-Hoc Retrieval
Results since 1998. In ACM International Conference on Information and Knowledge Management (pp. 601–610).
• Banks, D., Over, P., & Zhang, N.-F. (1999). Blind Men and Elephants: Six Approaches to TREC data. Information
Retrieval, 1(1-2), 7–34.
• Bodoff, D. (2008). Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and
Management, 44(3), 1117–1145.
• Bodoff, D., & Li, P. (2007). Test Theory for Assessing IR Test Collections. In International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 367–374).
• Brennan, R. L. (2001). Generalizability Theory. Springer.
• Buckley, C., & Voorhees, E. M. (2000). Evaluating Evaluation Measure Stability. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 33–34).
• Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million Query Track 2009 Overview. In Text REtrieval
Conference.
• Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation Over Thousands of Queries. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 651–658).
• Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I Had a Million Queries. In European
Conference on Information Retrieval (pp. 288–300).
• Lin, W.-H., & Hauptmann, A. (2005). Revisiting the Effect of Topic Set Size on Retrieval Error. In International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 637–638).
Reliability
• Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 533–540).
• Robertson, S., & Kanoulas, E. (2012). On Per-Topic Variance in IR Evaluation. In International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 891–900).
• Sakai, T. (2007). On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information
Processing and Management, 43(2), 531–548.
• Sanderson, M., & Zobel, J. (2005). Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169).
• Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in Effectiveness Across Sub-collections. In
ACM International Conference on Information and Knowledge Management (pp. 1965–1969).
• Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications.
• Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information
Retrieval Evaluation. In ACM International Conference on Information and Knowledge Management (pp. 623–
632).
• Urbano, J., Marrero, M., & Martín, D. (2013). A Comparison of the Optimality of Statistical Significance Tests for
Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 925–928).
• Urbano, J., Marrero, M., & Martín, D. (2013). On the Measurement of Test Collection Reliability. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 393–402).
• Voorhees, E. M. (2000). Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness.
Information Processing and Management, 36(5), 697–716.
• Voorhees, E. M. (2009). Topic Set Size Redux. In International ACM SIGIR Conference on Research and
Development in Information Retrieval (pp. 806–807).
Reliability
• Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 316–323).
• Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. In ACM International
Conference on Information and Knowledge Management (pp. 571–580).
• Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A New Rank Correlation Coefficient for Information Retrieval. In
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 587–594).
• Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? In International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314).