SlideShare a Scribd company logo
1 of 40
Download to read offline
RD‐003 
Designing Test Collections That 
Provide Tight Confidence Intervals 
@tetsuyasakai 
Waseda University 
September 5@FIT 2014, Tsukuba University
Acknowledgement 
This research is a part of Waseda University’s project 
“Taxonomising and Evaluating Web Search Engine User Behaviours,” 
supported by Microsoft Research. 
THANK YOU!
Takeaways 
• It is possible to determine the topic set size n based on statistical 
requirements. Our approach requires a tight CI for any pairwise 
system comparisons. 
• CIs depend on variances, and variances depend on the choice of 
evaluation measures. Therefore test collections should be designed 
with evaluation measures in mind. 
• Our analysis can save a lot of relevance assessment cost – provides a 
set of statistically equally reliable designs (n, pd) with substantially 
different costs. 
Topic set size Pool depth
TALK OUTLINE 
1. How Information Retrieval (IR) test collections are constructed 
2. Statistical reform 
3. How test collections SHOULD be constructed 
4. Experimental results 
5. Conclusions and future work
Test collections = 
standard data sets for evaluation 
Test collection A Test collection B 
Evaluation 
measure 
values 
Evaluation 
measure 
values
An Information Retrieval (IR) test collection 
Topic set “Qrels = query relevance sets” 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
FIT 2014 home 
: : 
page www.ipsj.or.jp/event/fit/fit2014/: highly relevant 
www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Document collection 
www.honda.co.jp/Fit/: nonrelevant
How IR people build test collections (1) 
Okay, let’s build a test 
collection… 
Organiser
How IR people build test collections (2) 
…with maybe n=50 
topics (search 
requests)… 
ToTpoipcic 
ToTpoipcic Topic 1 
Well n>25 sounds good for statistical significance testing, 
but why 50? Why not 100? Why not 30?
How IR people build test collections (3) 
ToTpoipcic 
ToTpoipcic Topic 1 
50 topics 
Okay folks, give me your 
runs (search results)! 
run run run 
Participants
How IR people build test collections (4) 
ToTpoipcic 
ToTpoipcic Topic 1 
50 topics 
Pool depth pd=100 looks 
affordable… 
Top pd=100 documents 
from each run 
run run run 
Pool 
for 
Topic 1 
Document collection too large to do 
exhaustive relevance assessments so 
judge pooled documents only
How IR people build test collections (5) 
ToTpoipcic 
ToTpoipcic Topic 1 
50 topics 
Top pd=100 documents 
from each run 
Pool 
for 
Topic 1 
Relevance assessments 
Highly relevant 
Partially relevant 
Nonrelevant
An Information Retrieval (IR) test collection 
Topic set “Qrels = query relevance sets” 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
FIT 2014 home 
: : 
page www.ipsj.or.jp/event/fit/fit2014/: highly relevant 
www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Document collection 
www.honda.co.jp/Fit/: nonrelevant 
n=50 
topics… 
why? 
Pool depth pd=100 
(not exhaustive)
TALK OUTLINE 
1. How Information Retrieval (IR) test collections are constructed 
2. Statistical reform 
3. How test collections SHOULD be constructed 
4. Experimental results 
5. Conclusions and future work
NHST = null hypothesis significance testing (1) 
EXAMPLE: paired t‐test for comparing systems X and Y with n topics 
Assumptions: 
Null hypothesis: 
Test statistic: 
Population means are the same
NHST = null hypothesis significance testing (2) 
EXAMPLE: paired t‐test for comparing systems X and Y with n topics 
Null hypothesis: 
Test statistic: 
Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
NHST = null hypothesis significance testing (3) 
EXAMPLE: paired t‐test for comparing systems X and Y with n topics 
Null hypothesis: 
Under H0, t0 obeys a t distribution with n‐1 degrees of freedom. 
Given a significance criterion α(=0.05), 
reject H0 if |t0| >= t(n‐1; α). 
0.4 
0.3 
0.2 
0.1 
0 
‐t(n‐1; α) 
n=50 
t(n‐1; α) 
“H0 is probably not true because 
the chance of observing t0 under H0 
is very small”
NHST = null hypothesis significance testing (4) 
EXAMPLE: paired t‐test for comparing systems X and Y with n topics 
Null hypothesis: 
Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α). 
0.4 
0.3 
t0 t0 
0.2 
0.1 
0 
‐t(n‐1; α) 
n=50 
t(n‐1; α) 
0.4 
0.3 
0.2 
0.1 
0 
Conclusion: 
X ≠ Y! 
‐t(n‐1; α) 
n=50 
t(n‐1; α) 
Conclusion: 
H0 not rejected, 
so don’t know
NHST is not good enough [Cumming12] 
• Dichotomous thinking ( “different or not different?” ) 
A more important question is “what is the magnitude of the 
difference?” Another is “How accurate is my estimate?” 
• p‐values a little more informative than “significant at α=0.05” but… 
0.4 
0.3 
0.2 
0.1 
0 
‐t(n‐1; α) 
n=50 
t(n‐1; α) 
t0 
Probability of 
observing t0 or something 
more extreme under H0
The p‐value is not good enough either 
[Nagata03] 
Reject H0 if |t0| >= t(n‐1; α) where 
But a large |t0| could mean two things: 
(1) Sample effect size (ES) 
is large; 
(2) Topic set size n is large. 
Difference between X and Y 
measured in standard deviation 
units 
If you increase the sample size n, you can always achieve statistical 
significance!
Statistical reform – effect sizes 
[Cumming12,Okubo12] 
• ES: “how much difference is there?” 
• ES for paired t test measures difference in standard deviation units 
Population ES = 
Sample ES as an estimate of the above = 
In several research disciplines such as psychology and medicine, it is required 
to report ESs! But ESs are rarely discussed in IR, NLP, etc…
Statistical reform – confidence intervals 
• CIs are much more 
informative than NHST 
(point estimate + 
uncertainty/accuracy) 
• Estimation thinking, not 
dichotomous thinking 
[Cumming12] 
[Sakai14forum] 
In several research disciplines such as psychology and medicine, it is required 
to report CIs! But CIs are rarely discussed in IR, NLP, etc…
TALK OUTLINE 
1. How Information Retrieval (IR) test collections are constructed 
2. Statistical reform 
3. How test collections SHOULD be constructed 
4. Experimental results 
5. Conclusions and future work
CI basics (1) 
obeys a t distribution with n‐1 degrees of freedom. 
Hence, for a given α, α/2 α/2
CI basics (2) 
obeys a t distribution with n‐1 degrees of freedom. 
Hence, for a given α, 
⇒ 
where 
That is, the 95% CI of the difference between X and Y is given by
Sample size design based on a tight CI (1) 
[Nagata03] 
• To set the topic set size n, require that the CI (2*MOE) be no larger 
than a constant δ. 
• Since contains a random variable V, 
impose the above on the expectation of CI. That is, 
require:
Sample size design based on a tight CI (2) 
[Nagata03] 
• Require: 
• It is known that 
cf. 
• So what we want is the 
smallest n that satisfies: 
No closed 
form for n
Sample size design based on a tight CI (3) 
[Nagata03] 
• So what we want is the 
smallest n that satisfies: 
• To find the n, start with the “easy” case where the population 
variance is known. 
No closed 
form for n 
Variance unknown Variance known
Sample size design based on a tight CI (4) 
[Nagata03] 
• So what we want is the 
smallest n that satisfies: 
• To find the n, start with the “easy” case where the population 
variance is known. 
• Require: 
• Obtain the smallest n’ s.t. 
and increment until 
the original requirement is met! 
No closed 
form for n 
But we need an estimate 
of
Estimating (1) 
Data #topics runs pd #docs 
TREC03new 50 78 125 528,155 news articles 
TREC04new 49 78 100 ditto 
TREC11w 50 37 25 One billion web pages 
TREC12w 50 28 20/30 ditto 
TREC11wD 50 25 25 ditto 
TREC12wD 50 20 20/30 ditto 
Adhoc news 
IR 
Adhoc web 
IR 
Diversified web 
IR 
Compute V for every system pair 
(78*77/2=3,003 pairs); 
then take the 95% percentile [Webber08] 
Pool variance estimates from 
two data sets
Estimating (2) 
See 
[Sakai14PROMISE] 
for definitions of 
measures 
Evaluating top 
1,000 documents 
Evaluating top 
10 documents
Demo 
Just enter and you will get your n! 
http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
TALK OUTLINE 
1. How Information Retrieval (IR) test collections are constructed 
2. Statistical reform 
3. How test collections SHOULD be constructed 
4. Experimental results 
5. Conclusions and future work
Results 
Q requires the fewest 
topics 
D‐nDCG requires the 
fewest topics 
Required n depends heavily 
on the stability of the evaluation 
measure!
What if we reduce the pool depth pd? 
ToTpoipcic 
ToTpoipcic Topic 1 
For adhoc/news l=1000 (pd=100) only 
n=50 topics 
Top pd=100 documents 
from each run 
Pool 
for 
Topic 1 
Relevance assessments 
Highly relevant 
Partially relevant 
Nonrelevant
Pool depth vs 
pd reduced from 100 to 10 
#relevance assessments per topic also reduced 
Variance increases in general, except for nERR
Statistically 
equivalent 
test collection 
designs for 
TREC adhoc 
news (l=1,000) 
For Q, 
the pd=10 design 
is only 18% 
as costly as 
the pd=100 design!
TALK OUTLINE 
1. How Information Retrieval (IR) test collections are constructed 
2. Statistical reform 
3. How test collections SHOULD be constructed 
4. Experimental results 
5. Conclusions and future work
Takeaways 
• It is possible to determine the topic set size n based on statistical 
requirements. Our approach requires a tight CI for any pairwise 
system comparisons. 
• CIs depend on variances, and variances depend on the choice of 
evaluation measures. Therefore test collections should be designed 
with evaluation measures in mind. 
• Our analysis can save a lot of relevance assessment cost – provides a 
set of statistically equally reliable designs (n, pd) with substantially 
different costs. 
Topic set size Pool depth
Future work 
• Alternative approach: determining n from a minimum detectable ES 
instead of a maximum allowable CI: DONE [Sakai14CIKM] 
• Using variance estimates based on ANOVA statistics: DONE 
• Estimating n for various tasks (not just IR) – the method is applicable 
to any paired‐data evaluation tasks 
• Given a set of statistically equally reliable designs (n,pd), choose the 
best one based on reusability and assessment cost 
Can we evaluate new systems fairly?
References 
[Cumming12] Cumming, G.: Understanding The New Statistics: Effect Sizes, Confidence 
Intervals, and Meta‐Analysis. Routledge, 2012. 
[Nagata03] Nagata, Y.: How to Design the Sample Size. Asakura Shoten, 2003. 
[Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, 
Condence Interval (in Japanese). Keiso Shobo, 2012. 
[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests. PROMISE Winter School 2013: 
Bridging between Information Retrieval and Databases (LNCS 8173), pp.116‐163, Springer, 
2014. 
[Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 
2014. 
[Sakai14CIKM] Sakai, T.: Designing Test Collections for Comparing Many Systems, ACM 
CIKM 2014, to appear, 2014. 
[Webber08] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval 
Experimentation. ACM CIKM 2008, pp.571–580, 2008.

More Related Content

What's hot

On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析台灣資料科學年會
 
Aed1222 lesson 4
Aed1222 lesson 4Aed1222 lesson 4
Aed1222 lesson 4nurun2010
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6Roger Barga
 
Visualiation of quantitative information
Visualiation of quantitative informationVisualiation of quantitative information
Visualiation of quantitative informationJames Neill
 
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...gregoryg
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partnurun2010
 
Andrii Belas: A/B testing overview: use-cases, theory and tools
Andrii Belas: A/B testing overview: use-cases, theory and toolsAndrii Belas: A/B testing overview: use-cases, theory and tools
Andrii Belas: A/B testing overview: use-cases, theory and toolsLviv Startup Club
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 

What's hot (20)

Part1
Part1Part1
Part1
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
 
Aed1222 lesson 4
Aed1222 lesson 4Aed1222 lesson 4
Aed1222 lesson 4
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Visualiation of quantitative information
Visualiation of quantitative informationVisualiation of quantitative information
Visualiation of quantitative information
 
Active learning
Active learningActive learning
Active learning
 
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd part
 
Andrii Belas: A/B testing overview: use-cases, theory and tools
Andrii Belas: A/B testing overview: use-cases, theory and toolsAndrii Belas: A/B testing overview: use-cases, theory and tools
Andrii Belas: A/B testing overview: use-cases, theory and tools
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start Recommendations
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Rm tutorial
Rm tutorialRm tutorial
Rm tutorial
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 

Viewers also liked

NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)Tetsuya Sakai
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsTetsuya Sakai
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
 
Short Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffShort Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffTetsuya Sakai
 
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...Tetsuya Sakai
 
Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016Maria Eskevich
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignTetsuya Sakai
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai
 

Viewers also liked (15)

NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
Short Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffShort Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 Kickoff
 
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
 
Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
 
ictir2016
ictir2016ictir2016
ictir2016
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 

Similar to Designing Test Collections That Provide Tight Confidence Intervals

Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptxBillyMoses1
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsAlejandro Bellogin
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docxSTAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docxdessiechisomjj4
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
The t Test for Related.docx
The t Test for Related.docxThe t Test for Related.docx
The t Test for Related.docxchristalgrieg
 
PSYCH 625 Exceptional Education - snaptutorial.com
PSYCH 625   Exceptional Education - snaptutorial.comPSYCH 625   Exceptional Education - snaptutorial.com
PSYCH 625 Exceptional Education - snaptutorial.comDavisMurphyB22
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdfGerryMakilan2
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docxmattinsonjanel
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software EngineeringDaniel Mendez
 
NR 505 Great Stories /newtonhelp.com
NR 505 Great Stories /newtonhelp.comNR 505 Great Stories /newtonhelp.com
NR 505 Great Stories /newtonhelp.combellflower212
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 

Similar to Designing Test Collections That Provide Tight Confidence Intervals (20)

Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender Systems
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docxSTAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
STAT 3309SOS 3312 Excel AssignmentUsing the Excel file ST.docx
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
The t Test for Related.docx
The t Test for Related.docxThe t Test for Related.docx
The t Test for Related.docx
 
EDA
EDAEDA
EDA
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
PSYCH 625 Exceptional Education - snaptutorial.com
PSYCH 625   Exceptional Education - snaptutorial.comPSYCH 625   Exceptional Education - snaptutorial.com
PSYCH 625 Exceptional Education - snaptutorial.com
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docx
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
 
NR 505 Great Stories /newtonhelp.com
NR 505 Great Stories /newtonhelp.comNR 505 Great Stories /newtonhelp.com
NR 505 Great Stories /newtonhelp.com
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 

More from Tetsuya Sakai

More from Tetsuya Sakai (13)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Designing Test Collections That Provide Tight Confidence Intervals

  • 1. RD‐003 Designing Test Collections That Provide Tight Confidence Intervals @tetsuyasakai Waseda University September 5@FIT 2014, Tsukuba University
  • 2. Acknowledgement This research is a part of Waseda University’s project “Taxonomising and Evaluating Web Search Engine User Behaviours,” supported by Microsoft Research. THANK YOU!
  • 3. Takeaways • It is possible to determine the topic set size n based on statistical requirements. Our approach requires a tight CI for any pairwise system comparisons. • CIs depend on variances, and variances depend on the choice of evaluation measures. Therefore test collections should be designed with evaluation measures in mind. • Our analysis can save a lot of relevance assessment cost – provides a set of statistically equally reliable designs (n, pd) with substantially different costs. Topic set size Pool depth
  • 4. TALK OUTLINE 1. How Information Retrieval (IR) test collections are constructed 2. Statistical reform 3. How test collections SHOULD be constructed 4. Experimental results 5. Conclusions and future work
  • 5. Test collections = standard data sets for evaluation Test collection A Test collection B Evaluation measure values Evaluation measure values
  • 6. An Information Retrieval (IR) test collection Topic set “Qrels = query relevance sets” Topic Relevance assessments (relevant/nonrelevant documents) Topic Relevance assessments (relevant/nonrelevant documents) FIT 2014 home : : page www.ipsj.or.jp/event/fit/fit2014/: highly relevant www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant Topic Relevance assessments (relevant/nonrelevant documents) Document collection www.honda.co.jp/Fit/: nonrelevant
  • 7. How IR people build test collections (1) Okay, let’s build a test collection… Organiser
  • 8. How IR people build test collections (2) …with maybe n=50 topics (search requests)… ToTpoipcic ToTpoipcic Topic 1 Well n>25 sounds good for statistical significance testing, but why 50? Why not 100? Why not 30?
  • 9. How IR people build test collections (3) ToTpoipcic ToTpoipcic Topic 1 50 topics Okay folks, give me your runs (search results)! run run run Participants
  • 10. How IR people build test collections (4) ToTpoipcic ToTpoipcic Topic 1 50 topics Pool depth pd=100 looks affordable… Top pd=100 documents from each run run run run Pool for Topic 1 Document collection too large to do exhaustive relevance assessments so judge pooled documents only
  • 11. How IR people build test collections (5) ToTpoipcic ToTpoipcic Topic 1 50 topics Top pd=100 documents from each run Pool for Topic 1 Relevance assessments Highly relevant Partially relevant Nonrelevant
  • 12. An Information Retrieval (IR) test collection Topic set “Qrels = query relevance sets” Topic Relevance assessments (relevant/nonrelevant documents) Topic Relevance assessments (relevant/nonrelevant documents) FIT 2014 home : : page www.ipsj.or.jp/event/fit/fit2014/: highly relevant www.ipsj.or.jp/event/fit/fit2014/exhibit.html: partially relevant Topic Relevance assessments (relevant/nonrelevant documents) Document collection www.honda.co.jp/Fit/: nonrelevant n=50 topics… why? Pool depth pd=100 (not exhaustive)
  • 13. TALK OUTLINE 1. How Information Retrieval (IR) test collections are constructed 2. Statistical reform 3. How test collections SHOULD be constructed 4. Experimental results 5. Conclusions and future work
  • 14. NHST = null hypothesis significance testing (1) EXAMPLE: paired t‐test for comparing systems X and Y with n topics Assumptions: Null hypothesis: Test statistic: Population means are the same
  • 15. NHST = null hypothesis significance testing (2) EXAMPLE: paired t‐test for comparing systems X and Y with n topics Null hypothesis: Test statistic: Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
  • 16. NHST = null hypothesis significance testing (3) EXAMPLE: paired t‐test for comparing systems X and Y with n topics Null hypothesis: Under H0, t0 obeys a t distribution with n‐1 degrees of freedom. Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α). 0.4 0.3 0.2 0.1 0 ‐t(n‐1; α) n=50 t(n‐1; α) “H0 is probably not true because the chance of observing t0 under H0 is very small”
  • 17. NHST = null hypothesis significance testing (4) EXAMPLE: paired t‐test for comparing systems X and Y with n topics Null hypothesis: Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α). 0.4 0.3 t0 t0 0.2 0.1 0 ‐t(n‐1; α) n=50 t(n‐1; α) 0.4 0.3 0.2 0.1 0 Conclusion: X ≠ Y! ‐t(n‐1; α) n=50 t(n‐1; α) Conclusion: H0 not rejected, so don’t know
  • 18. NHST is not good enough [Cumming12] • Dichotomous thinking ( “different or not different?” ) A more important question is “what is the magnitude of the difference?” Another is “How accurate is my estimate?” • p‐values a little more informative than “significant at α=0.05” but… 0.4 0.3 0.2 0.1 0 ‐t(n‐1; α) n=50 t(n‐1; α) t0 Probability of observing t0 or something more extreme under H0
  • 19. The p‐value is not good enough either [Nagata03] Reject H0 if |t0| >= t(n‐1; α) where But a large |t0| could mean two things: (1) Sample effect size (ES) is large; (2) Topic set size n is large. Difference between X and Y measured in standard deviation units If you increase the sample size n, you can always achieve statistical significance!
  • 20. Statistical reform – effect sizes [Cumming12,Okubo12] • ES: “how much difference is there?” • ES for paired t test measures difference in standard deviation units Population ES = Sample ES as an estimate of the above = In several research disciplines such as psychology and medicine, it is required to report ESs! But ESs are rarely discussed in IR, NLP, etc…
  • 21. Statistical reform – confidence intervals • CIs are much more informative than NHST (point estimate + uncertainty/accuracy) • Estimation thinking, not dichotomous thinking [Cumming12] [Sakai14forum] In several research disciplines such as psychology and medicine, it is required to report CIs! But CIs are rarely discussed in IR, NLP, etc…
  • 22. TALK OUTLINE 1. How Information Retrieval (IR) test collections are constructed 2. Statistical reform 3. How test collections SHOULD be constructed 4. Experimental results 5. Conclusions and future work
  • 23. CI basics (1) obeys a t distribution with n‐1 degrees of freedom. Hence, for a given α, α/2 α/2
  • 24. CI basics (2) obeys a t distribution with n‐1 degrees of freedom. Hence, for a given α, ⇒ where That is, the 95% CI of the difference between X and Y is given by
  • 25. Sample size design based on a tight CI (1) [Nagata03] • To set the topic set size n, require that the CI (2*MOE) be no larger than a constant δ. • Since contains a random variable V, impose the above on the expectation of CI. That is, require:
  • 26. Sample size design based on a tight CI (2) [Nagata03] • Require: • It is known that cf. • So what we want is the smallest n that satisfies: No closed form for n
  • 27. Sample size design based on a tight CI (3) [Nagata03] • So what we want is the smallest n that satisfies: • To find the n, start with the “easy” case where the population variance is known. No closed form for n Variance unknown Variance known
  • 28. Sample size design based on a tight CI (4) [Nagata03] • So what we want is the smallest n that satisfies: • To find the n, start with the “easy” case where the population variance is known. • Require: • Obtain the smallest n’ s.t. and increment until the original requirement is met! No closed form for n But we need an estimate of
  • 29. Estimating (1) Data #topics runs pd #docs TREC03new 50 78 125 528,155 news articles TREC04new 49 78 100 ditto TREC11w 50 37 25 One billion web pages TREC12w 50 28 20/30 ditto TREC11wD 50 25 25 ditto TREC12wD 50 20 20/30 ditto Adhoc news IR Adhoc web IR Diversified web IR Compute V for every system pair (78*77/2=3,003 pairs); then take the 95% percentile [Webber08] Pool variance estimates from two data sets
  • 30. Estimating (2) See [Sakai14PROMISE] for definitions of measures Evaluating top 1,000 documents Evaluating top 10 documents
  • 31. Demo Just enter and you will get your n! http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
  • 32. TALK OUTLINE 1. How Information Retrieval (IR) test collections are constructed 2. Statistical reform 3. How test collections SHOULD be constructed 4. Experimental results 5. Conclusions and future work
  • 33. Results Q requires the fewest topics D‐nDCG requires the fewest topics Required n depends heavily on the stability of the evaluation measure!
  • 34. What if we reduce the pool depth pd? ToTpoipcic ToTpoipcic Topic 1 For adhoc/news l=1000 (pd=100) only n=50 topics Top pd=100 documents from each run Pool for Topic 1 Relevance assessments Highly relevant Partially relevant Nonrelevant
  • 35. Pool depth vs pd reduced from 100 to 10 #relevance assessments per topic also reduced Variance increases in general, except for nERR
  • 36. Statistically equivalent test collection designs for TREC adhoc news (l=1,000) For Q, the pd=10 design is only 18% as costly as the pd=100 design!
  • 37. TALK OUTLINE 1. How Information Retrieval (IR) test collections are constructed 2. Statistical reform 3. How test collections SHOULD be constructed 4. Experimental results 5. Conclusions and future work
  • 38. Takeaways • It is possible to determine the topic set size n based on statistical requirements. Our approach requires a tight CI for any pairwise system comparisons. • CIs depend on variances, and variances depend on the choice of evaluation measures. Therefore test collections should be designed with evaluation measures in mind. • Our analysis can save a lot of relevance assessment cost – provides a set of statistically equally reliable designs (n, pd) with substantially different costs. Topic set size Pool depth
  • 39. Future work • Alternative approach: determining n from a minimum detectable ES instead of a maximum allowable CI: DONE [Sakai14CIKM] • Using variance estimates based on ANOVA statistics: DONE • Estimating n for various tasks (not just IR) – the method is applicable to any paired‐data evaluation tasks • Given a set of statistically equally reliable designs (n,pd), choose the best one based on reusability and assessment cost Can we evaluate new systems fairly?
  • 40. References [Cumming12] Cumming, G.: Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta‐Analysis. Routledge, 2012. [Nagata03] Nagata, Y.: How to Design the Sample Size. Asakura Shoten, 2003. [Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, Condence Interval (in Japanese). Keiso Shobo, 2012. [Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests. PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pp.116‐163, Springer, 2014. [Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 2014. [Sakai14CIKM] Sakai, T.: Designing Test Collections for Comparing Many Systems, ACM CIKM 2014, to appear, 2014. [Webber08] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval Experimentation. ACM CIKM 2008, pp.571–580, 2008.