SlideShare a Scribd company logo
1 of 37
Download to read offline
Topic Set Size Design with 
Variance Estimates from 
Two‐Way ANOVA 
Tetsuya Sakai 
Waseda University 
@tetsuyasakai 
http://www.f.waseda.jp/tetsuya/ 
December 9, EVIA 2014@NTCIR‐11, Tokyo.
One‐page takeaways 
• The topic set size n for a new test collection can be determined 
systematically by 
(a) Ensuring high power (1‐β) whenever the between‐system difference (or 
difference between best and worst systems) is above a threshold; OR 
(b) Ensuring that the confidence interval for any pairwise system difference 
is below a threshold. 
• The above methods require a variance estimate for a particular evaluation 
measure. 
• Of the three variance estimation methods, the new Two‐Way ANOVA‐based 
method is the “safest” to use. 
• The right balance between n and pd (pool depth) can reduce the 
assessment cost to (say) 18%.
TALK OUTLINE 
1. How test collections have been constructed 
2. How test collections should be constructed 
3. Obtaining system variance estimates 
4. Topic set size design results 
5. Conclusions and future work
How test collections have been constructed (1) 
Okay, let’s build a test 
collection… 
Organiser
How test collections have been constructed (2) 
…with maybe n=50 
topics (search 
requests)… 
Topic 
Topic 
Topic 
Topic Topic 1 
Well n>25 sounds good for statistical significance testing, 
but why 50? Why not 100? Why not 30?
How test collections have been constructed (3) 
Topic 
Topic 
Topic 
Topic Topic 1 
50 topics 
Okay folks, give me your 
runs (search results)! 
run run run 
Participants
How test collections have been constructed (4) 
Topic 
Topic 
Topic 
Topic Topic 1 
50 topics 
Pool depth pd=100 looks 
affordable… 
Top pd=100 documents 
from each run 
run run run 
Pool 
for 
Topic 1 
Document collection too large to do 
exhaustive relevance assessments so 
judge pooled documents only
How test collections have been constructed (5) 
Topic 
Topic 
Topic 
Topic Topic 1 
50 topics 
Top pd=100 documents 
from each run 
Pool 
for 
Topic 1 
Relevance assessments 
Highly relevant 
Partially relevant 
Nonrelevant
An Information Retrieval (IR) test collection 
Topic set “Qrels = query relevance sets” 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
EVIA 2014 
home page research.nii.ac.jp/ntcir/evia2014/ : highly relevant 
: : 
research.nii.ac.jp/ntcir/ntcir‐11/ : partially relevant 
Topic Relevance assessments 
(relevant/nonrelevant documents) 
Document collection 
www.aroundevia.com : nonrelevant 
n=50 
topics… 
why? 
Pool depth pd=100 
(not exhaustive)
TALK OUTLINE 
1. How test collections have been constructed 
2. How test collections should be constructed 
3. Obtaining system variance estimates 
4. Topic set size design results 
5. Conclusions and future work
How test collections should be constructed 
• If p‐values / confidence intervals (CIs) [Sakai14SIGIRForum] are going to be 
computed, then the topic set size n should be determined starting from a 
set of statistical requirements [Nagata03]. 
• Two approaches: 
(a) Power‐based [Sakai14CIKM]: ensure high power (1‐β) whenever the 
between‐system difference (or difference between best and worst 
systems) is above a threshold; 
(a1) t‐test (m=2 systems) (a2) one‐way ANOVA (m>=2 systems) 
OR 
(b) CI‐based [Sakai14FIT]: Ensuring that the CI for any pairwise system 
difference is below a threshold.
(a1) t‐test‐based topic set size design [Sakai14CIKM] 
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx 
INPUT: 
α (Type I error probability: detecting a nonexistent difference) 
β (Type II error probability: missing a real difference) 
(minimum detectable difference between two systems) 
(variance of the between‐system difference) 
OUTPUT: required topic set size n
(a2) ANOVA‐based topic set size design [Sakai14CIKM] 
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx 
INPUT: 
α (Type I error probability: detecting a nonexistent difference) 
β (Type II error probability: missing a real difference) 
minD (minimum detectable range) 
(system variance, common to all systems) 
m (number of systems to be compared) 
OUTPUT: required topic set size n 
best 
μi 
μ 
minD 
worst 
ai = μi ‐ μ
(b) CI‐based topic set size design [Sakai14FIT] 
http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx 
INPUT: 
α (Type I error probability: detecting a nonexistent difference) 
δ (CI width upperbound for any system pair) 
(variance of the between‐system difference) 
OUTPUT: required topic set size n
TALK OUTLINE 
1. How test collections have been constructed 
2. How test collections should be constructed 
3. Obtaining system variance estimates 
4. Topic set size design results 
5. Conclusions and future work
Variance estimation method 1 
95% percentile method [Webber08SIGIR] 
For each of the k=m(m‐1)/2 system pairs from past data 
Compute the sample variance of the between‐system difference 
over n topics in the past data; 
Sort the k variances; 
Take the 95th percentile value as ;
Variance estimation method 2 
One‐way ANOVA‐based method [Sakai14CIKM] 
Between‐system variation Within‐system variation 
Estimates the population 
between‐system variance 
Estimates the population 
within‐system variance 
Let 
(probably an 
overestimate)
Variance estimation method 3 
Two‐way ANOVA‐based method [Okubo12] 
Between‐topic variation Within‐system variation 
Estimates the population 
between‐system variance 
Estimates the population 
between‐topic variance 
Estimates the population 
within‐system variance 
Let 
(probably an 
overestimate)
Data for estimating 
We have a topic‐by‐run matrix for each data set and evaluation measure 
Data #topics runs pd #docs 
TREC03new 50 78 125 528,155 news articles 
TREC04new 49 78 100 ditto 
TREC11w 50 37 25 One billion web pages 
TREC12w 50 28 20/30 ditto 
TREC11wD 50 25 25 ditto 
TREC12wD 50 20 20/30 ditto 
Adhoc news 
IR 
Adhoc web 
IR 
Diversified web 
IR 
For each task, two variances are pooled using 
etc.
Evaluation measures 
News 
(md=10,1000) 
Web (md=10) 
Web (md=10) 
md: measurement depth
Comparison of the variance estimation methods – 
two‐way ANOVA method is the most conservative 
0.12 
0.1 
0.08 
0.06 
0.04 
0.02 
0 
AP 
Q 
nDCG 
nERR 
AP 
Q 
nDCG 
nERR 
AP 
Q 
95% percentile 
one‐way ANOVA 
two‐way ANOVA 
nDCG 
nERR 
α‐nDCG 
nERR‐IA 
D‐nDCG 
D#‐nDCG 
(a1) (a2) (b) (c) 
adhoc/news 
(md=1000) 
adhoc/news 
(md=10) 
adhoc/web 
(md=10) 
diversity/web 
(md=10)
Variances obtained with the two‐way ANOVA‐based 
method 
nERR for adhoc 
quite unstable 
nERR‐IA and α‐nDCG 
for diversity 
quite unstable
TALK OUTLINE 
1. How test collections have been constructed 
2. How test collections should be constructed 
3. Obtaining system variance estimates 
4. Topic set size design results 
5. Conclusions and future work
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
(a2) adhoc/news (md=10) 
(α, β, minD) = (0.05, 0.20, 0.10) 
0 20 40 60 80 100 
AP Q nDCG nERR 
n 
m 
ANOVA‐based topic set sizes nERR requires MANY topics 
Q requires the fewest topics
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
ANOVA‐based topic set sizes nERR requires MANY topics 
AP requires MANY topics 
0 20 40 60 80 100 
AP Q nDCG nERR 
n 
m 
(b) adhoc/web (md=10) 
(α, β, minD) = (0.05, 0.20, 0.10) 
Q requires the fewest topics
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
ANOVA‐based topic set sizes nERR‐IA requires MANY topics 
0 20 40 60 80 100 
α‐nDCG nERR‐IA D‐nDCG D#‐nDCG 
n 
m 
(c) diversity/web (md=10) 
(α, β, minD) = (0.05, 0.20, 0.10) 
α‐nDCG requires MANY topics 
D‐nDCG requires the fewest topics
CI‐based topic set 
sizes (α=0.05) 
nERR requires MANY topics 
AP requires MANY topics 
Q requires the fewest topics 
nERR‐IA requires MANY topics 
α‐nDCG requires MANY topics 
D‐nDCG requires the fewest topics
0.5 
0.4 
0.3 
0.2 
0.1 
0 
Setting (α,β,minD,m)=(5%,20%,c,10) for ANOVA 
Setting (α,δ)=(5%, c) for CI for any c 
0 50 100 150 200 250 300 350 400 
δ (α=0.05) minD ( (α, β, m) = (0.05, 0.20, 2) ) 
minD ( (α, β, m) = (0.05, 0.20, 10) ) minD ( (α, β, m) = (0.05, 0.20, 100) ) 
n 
σෝ2 ൌ .0690 
(variance for Q‐measure, 
adhoc/news, md=10)
What if we reduce the pool depth pd? 
Topic 
Topic 
Topic 
Topic Topic 1 
For adhoc/news l=1000 (pd=100) only 
n=50 topics 
Top pd=100 documents 
from each run 
Pool 
for 
Topic 1 
Relevance assessments 
Highly relevant 
Partially relevant 
Nonrelevant
pd vs #judged/topic vs variance 
#judged 
documents/topic 
decreases 
Variances increase
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
0 100 200 300 400 500 600 700 800 
AP Q nDCG nERR 
(a) Power‐based results with 
(α, β, minD, m) = (0.05, 0.20, 0.15, 10) 
pd=50 pd=70 pd=100 
pd=30 
pd=10 
n 
Average #judged/topic 
Total cost for AP: 
96 docs/topic * 
100 topics 
= 9,600 docs 
Total cost for AP: 
731 docs/topic * 
74 topics 
= 54,094 docs
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
Total cost for AP: 
96 docs/topic * 
100 topics 
= 9,600 docs 
(b) CI‐based results with 
(α, δ) = (0.05, 0.15) 
pd=50 pd=70 pd=100 
0 100 200 300 400 500 600 700 800 
AP Q nDCG nERR 
Total cost for AP: 
731 docs/topic * 
75 topics 
= 54,825 docs 
n 
Average #judged/topic 
pd=30 
pd=10
TALK OUTLINE 
1. How test collections have been constructed 
2. How test collections should be constructed 
3. Obtaining system variance estimates 
4. Topic set size design results 
5. Conclusions and future work
One‐page takeaways 
• The topic set size n for a new test collection can be determined 
systematically by 
(a) Ensuring high power (1‐β) whenever the between‐system difference (or 
difference between best and worst systems) is above a threshold; OR 
(b) Ensuring that the confidence interval for any pairwise system difference 
is below a threshold. 
• The above methods require a variance estimate for a particular evaluation 
measure. 
• Of the three variance estimation methods, the new Two‐Way ANOVA‐based 
method is the “safest” to use. 
• The right balance between n and pd (pool depth) can reduce the 
assessment cost to (say) 18%.
Future work 
• Apply score standardization [Webber08SIGIR] to the topic‐by‐run 
matrices first 
• Investigate the effect of run spread in past data on estimating 
• Collect topic‐by‐run matrices from NTCIR task organisers to 
recommend the right number of topics n for their new test collection 
(or force them to use my topic set size design tools!) 
• Investigate the relationship between topic set size design with 
reusability. From a set of statistically equivalent designs, choose the 
least costly one with “tolerable” reusability.
REFERENCES 
[Nagata03] Nagata, Y.: How to Design the Sample Size (in Japanese). Asakura Shoten, 2003. 
[Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, Condence 
Interval (in Japanese). Keiso Shobo, 2012. 
[Sakai14SIGIRForum] Statistical Reform in Information Retrieval?, Sakai, T., SIGIR Forum, 48(1), pp.3‐ 
12, 2014. 
http://sigir.org/files/forum/2014J/2014J_sigirforum_Article_TetsuyaSakai.pdf 
[Sakai14FIT] Designing Test Collections That Provide Tight Confidence Intervals, Sakai, T., Forum on 
Information Technology 2014, RD‐003, 2014. http://www.slideshare.net/TetsuyaSakai/fit2014 
[Sakai14CIKM] Designing Test Collections for Comparing Many Systems, Sakai, T., Proceedings of 
ACM CIKM 2014, 2014. http://www.f.waseda.jp/tetsuya/CIKM2014/ir0030‐sakai.pdf 
[Webber08SIGIR] Webber, W., Moffat, A. and Zobel, J.: Score Standardization for Inter‐Collection 
Comparison of Retrieval Systems, ACM SIGIR 2008, pp.51‐58, 2008. 
[Webber08CIKM] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval 
Experimentation. ACM CIKM 2008, pp.571–580, 2008.

More Related Content

What's hot

Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
amarsri
 

What's hot (9)

Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Ember
EmberEmber
Ember
 
Spock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeSpock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted Vinke
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerText Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
 
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsConvex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso Solutions
 
Basic data structures in python
Basic data structures in pythonBasic data structures in python
Basic data structures in python
 

Viewers also liked

Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 

Viewers also liked (15)

ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)
 
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
 
Short Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffShort Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 Kickoff
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
 
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring...
 
Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016Video Hyperlinking (LNK) Task at TRECVid 2016
Video Hyperlinking (LNK) Task at TRECVid 2016
 
ictir2016
ictir2016ictir2016
ictir2016
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 

Similar to Topic Set Size Design with Variance Estimates from Two-Way ANOVA

NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5
nurun2010
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 

Similar to Topic Set Size Design with Variance Estimates from Two-Way ANOVA (20)

Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
R Basics
R BasicsR Basics
R Basics
 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5
 
Kmeans plusplus
Kmeans plusplusKmeans plusplus
Kmeans plusplus
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Certified Reasoning for Automated Verification
Certified Reasoning for Automated VerificationCertified Reasoning for Automated Verification
Certified Reasoning for Automated Verification
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Data wrangling with dplyr
Data wrangling with dplyrData wrangling with dplyr
Data wrangling with dplyr
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 

More from Tetsuya Sakai

More from Tetsuya Sakai (16)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Topic Set Size Design with Variance Estimates from Two-Way ANOVA

  • 1. Topic Set Size Design with Variance Estimates from Two‐Way ANOVA Tetsuya Sakai Waseda University @tetsuyasakai http://www.f.waseda.jp/tetsuya/ December 9, EVIA 2014@NTCIR‐11, Tokyo.
  • 2.
  • 3. One‐page takeaways • The topic set size n for a new test collection can be determined systematically by (a) Ensuring high power (1‐β) whenever the between‐system difference (or difference between best and worst systems) is above a threshold; OR (b) Ensuring that the confidence interval for any pairwise system difference is below a threshold. • The above methods require a variance estimate for a particular evaluation measure. • Of the three variance estimation methods, the new Two‐Way ANOVA‐based method is the “safest” to use. • The right balance between n and pd (pool depth) can reduce the assessment cost to (say) 18%.
  • 4. TALK OUTLINE 1. How test collections have been constructed 2. How test collections should be constructed 3. Obtaining system variance estimates 4. Topic set size design results 5. Conclusions and future work
  • 5. How test collections have been constructed (1) Okay, let’s build a test collection… Organiser
  • 6. How test collections have been constructed (2) …with maybe n=50 topics (search requests)… Topic Topic Topic Topic Topic 1 Well n>25 sounds good for statistical significance testing, but why 50? Why not 100? Why not 30?
  • 7. How test collections have been constructed (3) Topic Topic Topic Topic Topic 1 50 topics Okay folks, give me your runs (search results)! run run run Participants
  • 8. How test collections have been constructed (4) Topic Topic Topic Topic Topic 1 50 topics Pool depth pd=100 looks affordable… Top pd=100 documents from each run run run run Pool for Topic 1 Document collection too large to do exhaustive relevance assessments so judge pooled documents only
  • 9. How test collections have been constructed (5) Topic Topic Topic Topic Topic 1 50 topics Top pd=100 documents from each run Pool for Topic 1 Relevance assessments Highly relevant Partially relevant Nonrelevant
  • 10. An Information Retrieval (IR) test collection Topic set “Qrels = query relevance sets” Topic Relevance assessments (relevant/nonrelevant documents) Topic Relevance assessments (relevant/nonrelevant documents) EVIA 2014 home page research.nii.ac.jp/ntcir/evia2014/ : highly relevant : : research.nii.ac.jp/ntcir/ntcir‐11/ : partially relevant Topic Relevance assessments (relevant/nonrelevant documents) Document collection www.aroundevia.com : nonrelevant n=50 topics… why? Pool depth pd=100 (not exhaustive)
  • 11. TALK OUTLINE 1. How test collections have been constructed 2. How test collections should be constructed 3. Obtaining system variance estimates 4. Topic set size design results 5. Conclusions and future work
  • 12. How test collections should be constructed • If p‐values / confidence intervals (CIs) [Sakai14SIGIRForum] are going to be computed, then the topic set size n should be determined starting from a set of statistical requirements [Nagata03]. • Two approaches: (a) Power‐based [Sakai14CIKM]: ensure high power (1‐β) whenever the between‐system difference (or difference between best and worst systems) is above a threshold; (a1) t‐test (m=2 systems) (a2) one‐way ANOVA (m>=2 systems) OR (b) CI‐based [Sakai14FIT]: Ensuring that the CI for any pairwise system difference is below a threshold.
  • 13. (a1) t‐test‐based topic set size design [Sakai14CIKM] http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx INPUT: α (Type I error probability: detecting a nonexistent difference) β (Type II error probability: missing a real difference) (minimum detectable difference between two systems) (variance of the between‐system difference) OUTPUT: required topic set size n
  • 14. (a2) ANOVA‐based topic set size design [Sakai14CIKM] http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx INPUT: α (Type I error probability: detecting a nonexistent difference) β (Type II error probability: missing a real difference) minD (minimum detectable range) (system variance, common to all systems) m (number of systems to be compared) OUTPUT: required topic set size n best μi μ minD worst ai = μi ‐ μ
  • 15. (b) CI‐based topic set size design [Sakai14FIT] http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx INPUT: α (Type I error probability: detecting a nonexistent difference) δ (CI width upperbound for any system pair) (variance of the between‐system difference) OUTPUT: required topic set size n
  • 16. TALK OUTLINE 1. How test collections have been constructed 2. How test collections should be constructed 3. Obtaining system variance estimates 4. Topic set size design results 5. Conclusions and future work
  • 17. Variance estimation method 1 95% percentile method [Webber08SIGIR] For each of the k=m(m‐1)/2 system pairs from past data Compute the sample variance of the between‐system difference over n topics in the past data; Sort the k variances; Take the 95th percentile value as ;
  • 18. Variance estimation method 2 One‐way ANOVA‐based method [Sakai14CIKM] Between‐system variation Within‐system variation Estimates the population between‐system variance Estimates the population within‐system variance Let (probably an overestimate)
  • 19. Variance estimation method 3 Two‐way ANOVA‐based method [Okubo12] Between‐topic variation Within‐system variation Estimates the population between‐system variance Estimates the population between‐topic variance Estimates the population within‐system variance Let (probably an overestimate)
  • 20. Data for estimating We have a topic‐by‐run matrix for each data set and evaluation measure Data #topics runs pd #docs TREC03new 50 78 125 528,155 news articles TREC04new 49 78 100 ditto TREC11w 50 37 25 One billion web pages TREC12w 50 28 20/30 ditto TREC11wD 50 25 25 ditto TREC12wD 50 20 20/30 ditto Adhoc news IR Adhoc web IR Diversified web IR For each task, two variances are pooled using etc.
  • 21. Evaluation measures News (md=10,1000) Web (md=10) Web (md=10) md: measurement depth
  • 22. Comparison of the variance estimation methods – two‐way ANOVA method is the most conservative 0.12 0.1 0.08 0.06 0.04 0.02 0 AP Q nDCG nERR AP Q nDCG nERR AP Q 95% percentile one‐way ANOVA two‐way ANOVA nDCG nERR α‐nDCG nERR‐IA D‐nDCG D#‐nDCG (a1) (a2) (b) (c) adhoc/news (md=1000) adhoc/news (md=10) adhoc/web (md=10) diversity/web (md=10)
  • 23. Variances obtained with the two‐way ANOVA‐based method nERR for adhoc quite unstable nERR‐IA and α‐nDCG for diversity quite unstable
  • 24. TALK OUTLINE 1. How test collections have been constructed 2. How test collections should be constructed 3. Obtaining system variance estimates 4. Topic set size design results 5. Conclusions and future work
  • 25. 1000 900 800 700 600 500 400 300 200 100 0 (a2) adhoc/news (md=10) (α, β, minD) = (0.05, 0.20, 0.10) 0 20 40 60 80 100 AP Q nDCG nERR n m ANOVA‐based topic set sizes nERR requires MANY topics Q requires the fewest topics
  • 26. 1000 900 800 700 600 500 400 300 200 100 0 ANOVA‐based topic set sizes nERR requires MANY topics AP requires MANY topics 0 20 40 60 80 100 AP Q nDCG nERR n m (b) adhoc/web (md=10) (α, β, minD) = (0.05, 0.20, 0.10) Q requires the fewest topics
  • 27. 1000 900 800 700 600 500 400 300 200 100 0 ANOVA‐based topic set sizes nERR‐IA requires MANY topics 0 20 40 60 80 100 α‐nDCG nERR‐IA D‐nDCG D#‐nDCG n m (c) diversity/web (md=10) (α, β, minD) = (0.05, 0.20, 0.10) α‐nDCG requires MANY topics D‐nDCG requires the fewest topics
  • 28. CI‐based topic set sizes (α=0.05) nERR requires MANY topics AP requires MANY topics Q requires the fewest topics nERR‐IA requires MANY topics α‐nDCG requires MANY topics D‐nDCG requires the fewest topics
  • 29. 0.5 0.4 0.3 0.2 0.1 0 Setting (α,β,minD,m)=(5%,20%,c,10) for ANOVA Setting (α,δ)=(5%, c) for CI for any c 0 50 100 150 200 250 300 350 400 δ (α=0.05) minD ( (α, β, m) = (0.05, 0.20, 2) ) minD ( (α, β, m) = (0.05, 0.20, 10) ) minD ( (α, β, m) = (0.05, 0.20, 100) ) n σෝ2 ൌ .0690 (variance for Q‐measure, adhoc/news, md=10)
  • 30. What if we reduce the pool depth pd? Topic Topic Topic Topic Topic 1 For adhoc/news l=1000 (pd=100) only n=50 topics Top pd=100 documents from each run Pool for Topic 1 Relevance assessments Highly relevant Partially relevant Nonrelevant
  • 31. pd vs #judged/topic vs variance #judged documents/topic decreases Variances increase
  • 32. 180 160 140 120 100 80 60 40 20 0 0 100 200 300 400 500 600 700 800 AP Q nDCG nERR (a) Power‐based results with (α, β, minD, m) = (0.05, 0.20, 0.15, 10) pd=50 pd=70 pd=100 pd=30 pd=10 n Average #judged/topic Total cost for AP: 96 docs/topic * 100 topics = 9,600 docs Total cost for AP: 731 docs/topic * 74 topics = 54,094 docs
  • 33. 180 160 140 120 100 80 60 40 20 0 Total cost for AP: 96 docs/topic * 100 topics = 9,600 docs (b) CI‐based results with (α, δ) = (0.05, 0.15) pd=50 pd=70 pd=100 0 100 200 300 400 500 600 700 800 AP Q nDCG nERR Total cost for AP: 731 docs/topic * 75 topics = 54,825 docs n Average #judged/topic pd=30 pd=10
  • 34. TALK OUTLINE 1. How test collections have been constructed 2. How test collections should be constructed 3. Obtaining system variance estimates 4. Topic set size design results 5. Conclusions and future work
  • 35. One‐page takeaways • The topic set size n for a new test collection can be determined systematically by (a) Ensuring high power (1‐β) whenever the between‐system difference (or difference between best and worst systems) is above a threshold; OR (b) Ensuring that the confidence interval for any pairwise system difference is below a threshold. • The above methods require a variance estimate for a particular evaluation measure. • Of the three variance estimation methods, the new Two‐Way ANOVA‐based method is the “safest” to use. • The right balance between n and pd (pool depth) can reduce the assessment cost to (say) 18%.
  • 36. Future work • Apply score standardization [Webber08SIGIR] to the topic‐by‐run matrices first • Investigate the effect of run spread in past data on estimating • Collect topic‐by‐run matrices from NTCIR task organisers to recommend the right number of topics n for their new test collection (or force them to use my topic set size design tools!) • Investigate the relationship between topic set size design with reusability. From a set of statistically equivalent designs, choose the least costly one with “tolerable” reusability.
  • 37. REFERENCES [Nagata03] Nagata, Y.: How to Design the Sample Size (in Japanese). Asakura Shoten, 2003. [Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, Condence Interval (in Japanese). Keiso Shobo, 2012. [Sakai14SIGIRForum] Statistical Reform in Information Retrieval?, Sakai, T., SIGIR Forum, 48(1), pp.3‐ 12, 2014. http://sigir.org/files/forum/2014J/2014J_sigirforum_Article_TetsuyaSakai.pdf [Sakai14FIT] Designing Test Collections That Provide Tight Confidence Intervals, Sakai, T., Forum on Information Technology 2014, RD‐003, 2014. http://www.slideshare.net/TetsuyaSakai/fit2014 [Sakai14CIKM] Designing Test Collections for Comparing Many Systems, Sakai, T., Proceedings of ACM CIKM 2014, 2014. http://www.f.waseda.jp/tetsuya/CIKM2014/ir0030‐sakai.pdf [Webber08SIGIR] Webber, W., Moffat, A. and Zobel, J.: Score Standardization for Inter‐Collection Comparison of Retrieval Systems, ACM SIGIR 2008, pp.51‐58, 2008. [Webber08CIKM] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval Experimentation. ACM CIKM 2008, pp.571–580, 2008.