Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Designing Test Collections That Provide Tight Confidence Intervals
1. RD‐003
Designing Test Collections That
Provide Tight Confidence Intervals
@tetsuyasakai
Waseda University
September 5@FIT 2014, Tsukuba University
2. Acknowledgement
This research is a part of Waseda University’s project
“Taxonomising and Evaluating Web Search Engine User Behaviours,”
supported by Microsoft Research.
THANK YOU!
3. Takeaways
• It is possible to determine the topic set size n based on statistical
requirements. Our approach requires a tight CI for any pairwise
system comparisons.
• CIs depend on variances, and variances depend on the choice of
evaluation measures. Therefore test collections should be designed
with evaluation measures in mind.
• Our analysis can save a lot of relevance assessment cost – provides a
set of statistically equally reliable designs (n, pd) with substantially
different costs.
Topic set size Pool depth
4. TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed
2. Statistical reform
3. How test collections SHOULD be constructed
4. Experimental results
5. Conclusions and future work
5. Test collections =
standard data sets for evaluation
Test collection A Test collection B
Evaluation
measure
values
Evaluation
measure
values
6. An Information Retrieval (IR) test collection
Topic set “Qrels = query relevance sets”
Topic Relevance assessments
(relevant/nonrelevant documents)
Topic Relevance assessments
(relevant/nonrelevant documents)
FIT 2014 home
: :
page www.ipsj.or.jp/event/fit/fit2014/: highly relevant
www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant
Topic Relevance assessments
(relevant/nonrelevant documents)
Document collection
www.honda.co.jp/Fit/: nonrelevant
7. How IR people build test collections (1)
Okay, let’s build a test
collection…
Organiser
8. How IR people build test collections (2)
…with maybe n=50
topics (search
requests)…
ToTpoipcic
ToTpoipcic Topic 1
Well n>25 sounds good for statistical significance testing,
but why 50? Why not 100? Why not 30?
9. How IR people build test collections (3)
ToTpoipcic
ToTpoipcic Topic 1
50 topics
Okay folks, give me your
runs (search results)!
run run run
Participants
10. How IR people build test collections (4)
ToTpoipcic
ToTpoipcic Topic 1
50 topics
Pool depth pd=100 looks
affordable…
Top pd=100 documents
from each run
run run run
Pool
for
Topic 1
Document collection too large to do
exhaustive relevance assessments so
judge pooled documents only
11. How IR people build test collections (5)
ToTpoipcic
ToTpoipcic Topic 1
50 topics
Top pd=100 documents
from each run
Pool
for
Topic 1
Relevance assessments
Highly relevant
Partially relevant
Nonrelevant
12. An Information Retrieval (IR) test collection
Topic set “Qrels = query relevance sets”
Topic Relevance assessments
(relevant/nonrelevant documents)
Topic Relevance assessments
(relevant/nonrelevant documents)
FIT 2014 home
: :
page www.ipsj.or.jp/event/fit/fit2014/: highly relevant
www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant
Topic Relevance assessments
(relevant/nonrelevant documents)
Document collection
www.honda.co.jp/Fit/: nonrelevant
n=50
topics…
why?
Pool depth pd=100
(not exhaustive)
13. TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed
2. Statistical reform
3. How test collections SHOULD be constructed
4. Experimental results
5. Conclusions and future work
14. NHST = null hypothesis significance testing (1)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Assumptions:
Null hypothesis:
Test statistic:
Population means are the same
15. NHST = null hypothesis significance testing (2)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Null hypothesis:
Test statistic:
Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
16. NHST = null hypothesis significance testing (3)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Null hypothesis:
Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
Given a significance criterion α(=0.05),
reject H0 if |t0| >= t(n‐1; α).
0.4
0.3
0.2
0.1
0
‐t(n‐1; α)
n=50
t(n‐1; α)
“H0 is probably not true because
the chance of observing t0 under H0
is very small”
17. NHST = null hypothesis significance testing (4)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Null hypothesis:
Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α).
0.4
0.3
t0 t0
0.2
0.1
0
‐t(n‐1; α)
n=50
t(n‐1; α)
0.4
0.3
0.2
0.1
0
Conclusion:
X ≠ Y!
‐t(n‐1; α)
n=50
t(n‐1; α)
Conclusion:
H0 not rejected,
so don’t know
18. NHST is not good enough [Cumming12]
• Dichotomous thinking ( “different or not different?” )
A more important question is “what is the magnitude of the
difference?” Another is “How accurate is my estimate?”
• p‐values a little more informative than “significant at α=0.05” but…
0.4
0.3
0.2
0.1
0
‐t(n‐1; α)
n=50
t(n‐1; α)
t0
Probability of
observing t0 or something
more extreme under H0
19. The p‐value is not good enough either
[Nagata03]
Reject H0 if |t0| >= t(n‐1; α) where
But a large |t0| could mean two things:
(1) Sample effect size (ES)
is large;
(2) Topic set size n is large.
Difference between X and Y
measured in standard deviation
units
If you increase the sample size n, you can always achieve statistical
significance!
20. Statistical reform – effect sizes
[Cumming12,Okubo12]
• ES: “how much difference is there?”
• ES for paired t test measures difference in standard deviation units
Population ES =
Sample ES as an estimate of the above =
In several research disciplines such as psychology and medicine, it is required
to report ESs! But ESs are rarely discussed in IR, NLP, etc…
21. Statistical reform – confidence intervals
• CIs are much more
informative than NHST
(point estimate +
uncertainty/accuracy)
• Estimation thinking, not
dichotomous thinking
[Cumming12]
[Sakai14forum]
In several research disciplines such as psychology and medicine, it is required
to report CIs! But CIs are rarely discussed in IR, NLP, etc…
22. TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed
2. Statistical reform
3. How test collections SHOULD be constructed
4. Experimental results
5. Conclusions and future work
23. CI basics (1)
obeys a t distribution with n‐1 degrees of freedom.
Hence, for a given α, α/2 α/2
24. CI basics (2)
obeys a t distribution with n‐1 degrees of freedom.
Hence, for a given α,
⇒
where
That is, the 95% CI of the difference between X and Y is given by
25. Sample size design based on a tight CI (1)
[Nagata03]
• To set the topic set size n, require that the CI (2*MOE) be no larger
than a constant δ.
• Since contains a random variable V,
impose the above on the expectation of CI. That is,
require:
26. Sample size design based on a tight CI (2)
[Nagata03]
• Require:
• It is known that
cf.
• So what we want is the
smallest n that satisfies:
No closed
form for n
27. Sample size design based on a tight CI (3)
[Nagata03]
• So what we want is the
smallest n that satisfies:
• To find the n, start with the “easy” case where the population
variance is known.
No closed
form for n
Variance unknown Variance known
28. Sample size design based on a tight CI (4)
[Nagata03]
• So what we want is the
smallest n that satisfies:
• To find the n, start with the “easy” case where the population
variance is known.
• Require:
• Obtain the smallest n’ s.t.
and increment until
the original requirement is met!
No closed
form for n
But we need an estimate
of
29. Estimating (1)
Data #topics runs pd #docs
TREC03new 50 78 125 528,155 news articles
TREC04new 49 78 100 ditto
TREC11w 50 37 25 One billion web pages
TREC12w 50 28 20/30 ditto
TREC11wD 50 25 25 ditto
TREC12wD 50 20 20/30 ditto
Adhoc news
IR
Adhoc web
IR
Diversified web
IR
Compute V for every system pair
(78*77/2=3,003 pairs);
then take the 95% percentile [Webber08]
Pool variance estimates from
two data sets
30. Estimating (2)
See
[Sakai14PROMISE]
for definitions of
measures
Evaluating top
1,000 documents
Evaluating top
10 documents
31. Demo
Just enter and you will get your n!
http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
32. TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed
2. Statistical reform
3. How test collections SHOULD be constructed
4. Experimental results
5. Conclusions and future work
33. Results
Q requires the fewest
topics
D‐nDCG requires the
fewest topics
Required n depends heavily
on the stability of the evaluation
measure!
34. What if we reduce the pool depth pd?
ToTpoipcic
ToTpoipcic Topic 1
For adhoc/news l=1000 (pd=100) only
n=50 topics
Top pd=100 documents
from each run
Pool
for
Topic 1
Relevance assessments
Highly relevant
Partially relevant
Nonrelevant
35. Pool depth vs
pd reduced from 100 to 10
#relevance assessments per topic also reduced
Variance increases in general, except for nERR
36. Statistically
equivalent
test collection
designs for
TREC adhoc
news (l=1,000)
For Q,
the pd=10 design
is only 18%
as costly as
the pd=100 design!
37. TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed
2. Statistical reform
3. How test collections SHOULD be constructed
4. Experimental results
5. Conclusions and future work
38. Takeaways
• It is possible to determine the topic set size n based on statistical
requirements. Our approach requires a tight CI for any pairwise
system comparisons.
• CIs depend on variances, and variances depend on the choice of
evaluation measures. Therefore test collections should be designed
with evaluation measures in mind.
• Our analysis can save a lot of relevance assessment cost – provides a
set of statistically equally reliable designs (n, pd) with substantially
different costs.
Topic set size Pool depth
39. Future work
• Alternative approach: determining n from a minimum detectable ES
instead of a maximum allowable CI: DONE [Sakai14CIKM]
• Using variance estimates based on ANOVA statistics: DONE
• Estimating n for various tasks (not just IR) – the method is applicable
to any paired‐data evaluation tasks
• Given a set of statistically equally reliable designs (n,pd), choose the
best one based on reusability and assessment cost
Can we evaluate new systems fairly?
40. References
[Cumming12] Cumming, G.: Understanding The New Statistics: Effect Sizes, Confidence
Intervals, and Meta‐Analysis. Routledge, 2012.
[Nagata03] Nagata, Y.: How to Design the Sample Size. Asakura Shoten, 2003.
[Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size,
Condence Interval (in Japanese). Keiso Shobo, 2012.
[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests. PROMISE Winter School 2013:
Bridging between Information Retrieval and Databases (LNCS 8173), pp.116‐163, Springer,
2014.
[Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1),
2014.
[Sakai14CIKM] Sakai, T.: Designing Test Collections for Comparing Many Systems, ACM
CIKM 2014, to appear, 2014.
[Webber08] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval
Experimentation. ACM CIKM 2008, pp.571–580, 2008.