Stochastic Simulation of Test Collections: Evaluation Scores

SIGIR 2018 · July 11th · Ann ArborPicture by qiyang

Experiments in IR
3
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions

Experiments in IR
4
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions

Experiments in IR
5
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions

Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
6
How we Make Do

Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificially create other
collections by resampling
from the existing data
7
How we Make Do
Limited to dozens of
systems and topics from
past evaluations like TREC
s
t
?
?

Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Split in two topic sets and
consider results with one
subset as the truth
8
How we Make Do
Don’t know true properties
of systems, such as mean or
variance over topics
s
t
𝑿
𝑿
?
𝑿
𝑿
?

Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificial modifications of
effectiveness scores that
lead to invalid data
9
How we Make Do
Can’t control properties of
systems, such as true mean.
Systems are how they are
?
-1 10
-1 10

Stochastic Simulation
11
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
AP
P@10

• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
12
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model

• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
• Fit the model to existing
data to make it realistic
• Needs to be flexible to
model real data
13
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model

Model
• …of the joint distribution of system scores
• We use copula models, which separate:
1.Marginal distributions, of individual systems
2.Dependence structure, among systems
• Easy to customize: plug and play simulate
14

• Fit the model:
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
Model
16
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Simulate from the
model:
1. Generate pseudo-
observations
Model
16
𝑉
𝑈
*generalizes
to several
systems

• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Simulate from the
model:
1. Generate pseudo-
observations
2. Turn into effectiveness
scores
Model
16
𝑌 = 𝐹𝑌
−1
𝑉
𝑉
𝑋=𝐹𝑋
−1
𝑈
𝑈
*generalizes
to several
systems

Modeling Dependences
• Gaussian copulas
– Only correlation
– Only symmetric
• R-Vine copulas
– Allows tail dependence
– Allows asymmetricity
– Built from pair-copulas
(bivariate)
– Eg: F(S1,S2), F(S4,S2|S1),
F(S4,S3|S1,S2), …
– ~40 alternatives based on 12
different families
17

Modeling Margins
• All effectiveness measures have discrete
distributions, but for some we can fairly
assume they’re continuous
–AP, nDCG
• For some others, this assumption is clearly
wrong, so we must preserve the support
–P@10: 1, 0.9, 0.8, …
–RR: 1, 1/2, 1/3, …
18

Modeling Margins
Continuous
• (Truncated) Normal
• Beta
• (Truncated) Normal Kernel
Smoothing
• Beta Kernel Smoothing
Discrete
• Beta-Binomial
• Discrete Kernel Smoothing
• Discrete Kernel Smoothing
w/ controlled smoothness
19

Transform to Predefined Mean
Problem: given 𝐹, transform to 𝐹 such that
𝝁 = 𝝁∗
and preserving the support
Solution: transform with a specific Beta
find 𝛼, 𝛽 > 1
such that 𝜇 = 𝜇∗
where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽
20

Data
• TREC Web Ad hoc runs 2010-2014
– 50 topics and 30-88 systems each
– 12924 total system-topic pairs
• Continuous measures: AP, nDCG@20, ERR@20
• Discrete measures: P@10, P@20, RR
• Points of Interest
1. Margins
2. Copulas
3. Simulated scores
22

• 1572 system-measure pairs
• 5425 models successfully fitted
• Log-Likelihood:
• Kernel Smoothing (esp. discrete)
• Normal & Beta 25% of cases
• AIC and BIC:
• Normal & Beta 67% of cases
• Beta-Binomial 50% of P@k
• Transform all to the mean in the
given data and select again:
• Kernel Smoothing nearly always
1. Margins
23

• 39627 system pairs
• Fit pair-copulas and select
according to Log-Likelihood
• Wide diversity
• Gaussian copulas rarely
selected; correlation is not
enough
• Complex models are preferred
2. Dependence
24

• Simulate 1000 new topics and record deviations from the model
• 𝜇 − 𝑋 and 𝜎2 − 𝑠2
• Repeat 1000 times
• Full knowledge of truth encoded in the model
3. Simulation: Scores
25

• Web 2010, nDCG@20
• Simulate 500 new topics
• Dependence captured in the model
3. Simulation: Dependencies
26

[Voorhees, 2009]
1. Type I Errors?
28
S1
S2

[Voorhees, 2009]
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2

[Voorhees, 2009]
• Limited data
• Unknown truth (is H0 true?)
• No control over H0
• Cannot measure Type I error
rates directly
• Conflict rates at α=5%:
• AP: 2.8%
• P@10: 10.9%
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2

[With simulation]
Same margins
1. Type I Errors
29

[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
1. Type I Errors
29
p-value
Type I
error?

[With simulation]
Same margins
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30

[With simulation]
Same margins
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
• AP: 4.9% and 0.9%
• P@10: 5% and 1%
1. Type I Errors
30
p-value
Type I
error?

2. Sneak Peak: Statistical Power and σ
31
[Webber et al, 2008]
• Show empirical evidence of the problem of sequential testing
• Limited data
• Unknown truth (true σ)

Today
• Part of Evaluation Research has
data-related limitations
–Lack of data, no knowledge of truth, no control
–How valid are our results?
• We propose a methodology for stochastic
simulation to eliminate these limitations
–Flexible, realistic, highly customizable
–Allows us to study new problems, directly
33

Tomorrow
• Even more flexibility
• Simulate new systems for given topics
• Add third factors
–Fixed: already possible
–Random: we’ll see
• Simulate full runs (doc scores & relevance)
34

simIReff
• All results fully reproducible
• Developed a full R-package for simulation
https://github.com/julian-urbano/simIReff
effs <- effDiscFitAndSelect(data, support("p20"))
cop <- effcopFit(data, effs)
y <- reffcop(1000, cop)
35

Stochastic Simulation of Test Collections: Evaluation Scores

More Related Content

More from Julián Urbano

Recently uploaded

Stochastic Simulation of Test Collections: Evaluation Scores