Stochastic Simulation of Test Collections: Evaluation Scores

Julián Urbano
Julián UrbanoAssistant Professor at TU Delft
SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
MOTIVATION
Experiments in IR
3
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
4
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
5
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
6
How we Make Do
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificially create other
collections by resampling
from the existing data
7
How we Make Do
Limited to dozens of
systems and topics from
past evaluations like TREC
s
t
?
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Split in two topic sets and
consider results with one
subset as the truth
8
How we Make Do
Don’t know true properties
of systems, such as mean or
variance over topics
s
t
𝑿
𝑿
?
𝑿
𝑿
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificial modifications of
effectiveness scores that
lead to invalid data
9
How we Make Do
Can’t control properties of
systems, such as true mean.
Systems are how they are
?
-1 10
-1 10
PROPOSAL
10
Stochastic Simulation
11
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
AP
P@10
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
Stochastic Simulation
12
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
• Fit the model to existing
data to make it realistic
• Needs to be flexible to
model real data
Stochastic Simulation
13
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
Model
• …of the joint distribution of system scores
• We use copula models, which separate:
1.Marginal distributions, of individual systems
2.Dependence structure, among systems
• Easy to customize: plug and play simulate
14
• Fit the model:
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
Model
16
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
Model
16
𝑉
𝑈
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
2. Turn into effectiveness
scores
Model
16
𝑌 = 𝐹𝑌
−1
𝑉
𝑉
𝑋=𝐹𝑋
−1
𝑈
𝑈
*generalizes
to several
systems
Modeling Dependences
• Gaussian copulas
– Only correlation
– Only symmetric
• R-Vine copulas
– Allows tail dependence
– Allows asymmetricity
– Built from pair-copulas
(bivariate)
– Eg: F(S1,S2), F(S4,S2|S1),
F(S4,S3|S1,S2), …
– ~40 alternatives based on 12
different families
17
Modeling Margins
• All effectiveness measures have discrete
distributions, but for some we can fairly
assume they’re continuous
–AP, nDCG
• For some others, this assumption is clearly
wrong, so we must preserve the support
–P@10: 1, 0.9, 0.8, …
–RR: 1, 1/2, 1/3, …
18
Modeling Margins
Continuous
• (Truncated) Normal
• Beta
• (Truncated) Normal Kernel
Smoothing
• Beta Kernel Smoothing
Discrete
• Beta-Binomial
• Discrete Kernel Smoothing
• Discrete Kernel Smoothing
w/ controlled smoothness
19
Transform to Predefined Mean
Problem: given 𝐹, transform to 𝐹 such that
𝝁 = 𝝁∗
and preserving the support
Solution: transform with a specific Beta
find 𝛼, 𝛽 > 1
such that 𝜇 = 𝜇∗
where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽
20
RESULTS
21
Data
• TREC Web Ad hoc runs 2010-2014
– 50 topics and 30-88 systems each
– 12924 total system-topic pairs
• Continuous measures: AP, nDCG@20, ERR@20
• Discrete measures: P@10, P@20, RR
• Points of Interest
1. Margins
2. Copulas
3. Simulated scores
22
• 1572 system-measure pairs
• 5425 models successfully fitted
• Log-Likelihood:
• Kernel Smoothing (esp. discrete)
• Normal & Beta 25% of cases
• AIC and BIC:
• Normal & Beta 67% of cases
• Beta-Binomial 50% of P@k
• Transform all to the mean in the
given data and select again:
• Kernel Smoothing nearly always
1. Margins
23
• 39627 system pairs
• Fit pair-copulas and select
according to Log-Likelihood
• Wide diversity
• Gaussian copulas rarely
selected; correlation is not
enough
• Complex models are preferred
2. Dependence
24
• Simulate 1000 new topics and record deviations from the model
• 𝜇 − 𝑋 and 𝜎2 − 𝑠2
• Repeat 1000 times
• Full knowledge of truth encoded in the model
3. Simulation: Scores
25
• Web 2010, nDCG@20
• Simulate 500 new topics
• Dependence captured in the model
3. Simulation: Dependencies
26
SAMPLE APPLICATIONS
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[Voorhees, 2009]
• Limited data
• Unknown truth (is H0 true?)
• No control over H0
• Cannot measure Type I error
rates directly
• Conflict rates at α=5%:
• AP: 2.8%
• P@10: 10.9%
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
1. Type I Errors
29
p-value
Type I
error?
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 5% and 1%
1. Type I Errors
30
p-value
Type I
error?
2. Sneak Peak: Statistical Power and σ
31
[Webber et al, 2008]
• Show empirical evidence of the problem of sequential testing
• Limited data
• Unknown truth (true σ)
TAKE HOME
Today
• Part of Evaluation Research has
data-related limitations
–Lack of data, no knowledge of truth, no control
–How valid are our results?
• We propose a methodology for stochastic
simulation to eliminate these limitations
–Flexible, realistic, highly customizable
–Allows us to study new problems, directly
33
Tomorrow
• Even more flexibility
• Simulate new systems for given topics
• Add third factors
–Fixed: already possible
–Random: we’ll see
• Simulate full runs (doc scores & relevance)
34
simIReff
• All results fully reproducible
• Developed a full R-package for simulation
https://github.com/julian-urbano/simIReff
effs <- effDiscFitAndSelect(data, support("p20"))
cop <- effcopFit(data, effs)
y <- reffcop(1000, cop)
35
1 of 50

Recommended

Statistical Significance Testing in Information Retrieval: An Empirical Analy... by
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
758 views54 slides
Your PhD and You by
Your PhD and YouYour PhD and You
Your PhD and YouJulián Urbano
1.2K views96 slides
Statistical Analysis of Results in Music Information Retrieval: Why and How by
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
1.1K views227 slides
The Treatment of Ties in AP Correlation by
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
262 views1 slide
A Plan for Sustainable MIR Evaluation by
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
557 views30 slides
Crawling the Web for Structured Documents by
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
477 views2 slides

More Related Content

More from Julián Urbano

MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres... by
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
286 views1 slide
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track by
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
251 views1 slide
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea... by
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
355 views1 slide
Evaluation in (Music) Information Retrieval through the Audio Music Similarit... by
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
2K views132 slides
Symbolic Melodic Similarity (through Shape Similarity) by
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
1.2K views52 slides
Evaluation in Audio Music Similarity by
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
1.1K views131 slides

More from Julián Urbano(17)

MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres... by Julián Urbano
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
Julián Urbano286 views
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track by Julián Urbano
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Julián Urbano251 views
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea... by Julián Urbano
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
Julián Urbano355 views
Evaluation in (Music) Information Retrieval through the Audio Music Similarit... by Julián Urbano
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Julián Urbano2K views
Symbolic Melodic Similarity (through Shape Similarity) by Julián Urbano
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
Julián Urbano1.2K views
Evaluation in Audio Music Similarity by Julián Urbano
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
Julián Urbano1.1K views
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval by Julián Urbano
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano1.4K views
On the Measurement of Test Collection Reliability by Julián Urbano
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano635 views
How Significant is Statistically Significant? The case of Audio Music Similar... by Julián Urbano
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano923 views
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and... by Julián Urbano
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Julián Urbano737 views
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo... by Julián Urbano
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
Julián Urbano640 views
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu... by Julián Urbano
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Julián Urbano1.1K views
Audio Music Similarity and Retrieval: Evaluation Power and Stability by Julián Urbano
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Julián Urbano646 views
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ... by Julián Urbano
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Julián Urbano451 views
Improving the Generation of Ground Truths based on Partially Ordered Lists by Julián Urbano
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
Julián Urbano460 views
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks by Julián Urbano
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Julián Urbano597 views
Using the Shape of Music to Compute the similarity between Symbolic Musical P... by Julián Urbano
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Julián Urbano478 views

Recently uploaded

ALGAL PRODUCTS.pptx by
ALGAL PRODUCTS.pptxALGAL PRODUCTS.pptx
ALGAL PRODUCTS.pptxRASHMI M G
7 views17 slides
Indian council for child welfare by
Indian council for child welfareIndian council for child welfare
Indian council for child welfareRenuWaghmare2
11 views21 slides
Best Hybrid Event Platform.pptx by
Best Hybrid Event Platform.pptxBest Hybrid Event Platform.pptx
Best Hybrid Event Platform.pptxHarriet Davis
11 views13 slides
Bacterial Reproduction.pdf by
Bacterial Reproduction.pdfBacterial Reproduction.pdf
Bacterial Reproduction.pdfNandadulalSannigrahi
37 views32 slides
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... by
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...Sérgio Sacani
787 views14 slides
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...SwagatBehera9
6 views36 slides

Recently uploaded(20)

Indian council for child welfare by RenuWaghmare2
Indian council for child welfareIndian council for child welfare
Indian council for child welfare
RenuWaghmare211 views
Best Hybrid Event Platform.pptx by Harriet Davis
Best Hybrid Event Platform.pptxBest Hybrid Event Platform.pptx
Best Hybrid Event Platform.pptx
Harriet Davis11 views
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... by Sérgio Sacani
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
Sérgio Sacani787 views
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera96 views
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy... by Anmol Vishnu Gupta
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
DNA manipulation Enzymes 2.pdf by NetHelix
DNA manipulation Enzymes 2.pdfDNA manipulation Enzymes 2.pdf
DNA manipulation Enzymes 2.pdf
NetHelix6 views
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife184 views
INTRODUCTION TO PLANT SYSTEMATICS.pptx by RASHMI M G
INTRODUCTION TO PLANT SYSTEMATICS.pptxINTRODUCTION TO PLANT SYSTEMATICS.pptx
INTRODUCTION TO PLANT SYSTEMATICS.pptx
RASHMI M G 5 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
ELECTRON TRANSPORT CHAIN by DEEKSHA RANI
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAIN
DEEKSHA RANI18 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani23 views
Generative AI to Accelerate Discovery of Materials by Deakin University
Generative AI to Accelerate Discovery of MaterialsGenerative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of Materials
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank29 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI10 views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa13 views

Stochastic Simulation of Test Collections: Evaluation Scores

  • 1. SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
  • 3. Experiments in IR 3 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 4. Experiments in IR 4 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 5. Experiments in IR 5 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 6. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control 6 How we Make Do
  • 7. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificially create other collections by resampling from the existing data 7 How we Make Do Limited to dozens of systems and topics from past evaluations like TREC s t ? ?
  • 8. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Split in two topic sets and consider results with one subset as the truth 8 How we Make Do Don’t know true properties of systems, such as mean or variance over topics s t 𝑿 𝑿 ? 𝑿 𝑿 ?
  • 9. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificial modifications of effectiveness scores that lead to invalid data 9 How we Make Do Can’t control properties of systems, such as true mean. Systems are how they are ? -1 10 -1 10
  • 11. Stochastic Simulation 11 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model AP P@10
  • 12. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control Stochastic Simulation 12 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 13. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control • Fit the model to existing data to make it realistic • Needs to be flexible to model real data Stochastic Simulation 13 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 14. Model • …of the joint distribution of system scores • We use copula models, which separate: 1.Marginal distributions, of individual systems 2.Dependence structure, among systems • Easy to customize: plug and play simulate 14
  • 15. • Fit the model: Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 16. • Fit the model: 1. Fit the margins Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 17. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 18. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 19. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will Model 16 *generalizes to several systems
  • 20. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations Model 16 𝑉 𝑈 *generalizes to several systems
  • 21. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations 2. Turn into effectiveness scores Model 16 𝑌 = 𝐹𝑌 −1 𝑉 𝑉 𝑋=𝐹𝑋 −1 𝑈 𝑈 *generalizes to several systems
  • 22. Modeling Dependences • Gaussian copulas – Only correlation – Only symmetric • R-Vine copulas – Allows tail dependence – Allows asymmetricity – Built from pair-copulas (bivariate) – Eg: F(S1,S2), F(S4,S2|S1), F(S4,S3|S1,S2), … – ~40 alternatives based on 12 different families 17
  • 23. Modeling Margins • All effectiveness measures have discrete distributions, but for some we can fairly assume they’re continuous –AP, nDCG • For some others, this assumption is clearly wrong, so we must preserve the support –P@10: 1, 0.9, 0.8, … –RR: 1, 1/2, 1/3, … 18
  • 24. Modeling Margins Continuous • (Truncated) Normal • Beta • (Truncated) Normal Kernel Smoothing • Beta Kernel Smoothing Discrete • Beta-Binomial • Discrete Kernel Smoothing • Discrete Kernel Smoothing w/ controlled smoothness 19
  • 25. Transform to Predefined Mean Problem: given 𝐹, transform to 𝐹 such that 𝝁 = 𝝁∗ and preserving the support Solution: transform with a specific Beta find 𝛼, 𝛽 > 1 such that 𝜇 = 𝜇∗ where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽 20
  • 27. Data • TREC Web Ad hoc runs 2010-2014 – 50 topics and 30-88 systems each – 12924 total system-topic pairs • Continuous measures: AP, nDCG@20, ERR@20 • Discrete measures: P@10, P@20, RR • Points of Interest 1. Margins 2. Copulas 3. Simulated scores 22
  • 28. • 1572 system-measure pairs • 5425 models successfully fitted • Log-Likelihood: • Kernel Smoothing (esp. discrete) • Normal & Beta 25% of cases • AIC and BIC: • Normal & Beta 67% of cases • Beta-Binomial 50% of P@k • Transform all to the mean in the given data and select again: • Kernel Smoothing nearly always 1. Margins 23
  • 29. • 39627 system pairs • Fit pair-copulas and select according to Log-Likelihood • Wide diversity • Gaussian copulas rarely selected; correlation is not enough • Complex models are preferred 2. Dependence 24
  • 30. • Simulate 1000 new topics and record deviations from the model • 𝜇 − 𝑋 and 𝜎2 − 𝑠2 • Repeat 1000 times • Full knowledge of truth encoded in the model 3. Simulation: Scores 25
  • 31. • Web 2010, nDCG@20 • Simulate 500 new topics • Dependence captured in the model 3. Simulation: Dependencies 26
  • 33. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  • 34. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  • 35. [Voorhees, 2009] 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 36. [Voorhees, 2009] • Limited data • Unknown truth (is H0 true?) • No control over H0 • Cannot measure Type I error rates directly • Conflict rates at α=5%: • AP: 2.8% • P@10: 10.9% 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 41. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% 1. Type I Errors 29 p-value Type I error?
  • 42. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 43. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 44. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 45. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 5% and 1% 1. Type I Errors 30 p-value Type I error?
  • 46. 2. Sneak Peak: Statistical Power and σ 31 [Webber et al, 2008] • Show empirical evidence of the problem of sequential testing • Limited data • Unknown truth (true σ)
  • 48. Today • Part of Evaluation Research has data-related limitations –Lack of data, no knowledge of truth, no control –How valid are our results? • We propose a methodology for stochastic simulation to eliminate these limitations –Flexible, realistic, highly customizable –Allows us to study new problems, directly 33
  • 49. Tomorrow • Even more flexibility • Simulate new systems for given topics • Add third factors –Fixed: already possible –Random: we’ll see • Simulate full runs (doc scores & relevance) 34
  • 50. simIReff • All results fully reproducible • Developed a full R-package for simulation https://github.com/julian-urbano/simIReff effs <- effDiscFitAndSelect(data, support("p20")) cop <- effcopFit(data, effs) y <- reffcop(1000, cop) 35