SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
MOTIVATION
Experiments in IR
3
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
4
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
5
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
6
How we Make Do
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificially create other
collections by resampling
from the existing data
7
How we Make Do
Limited to dozens of
systems and topics from
past evaluations like TREC
s
t
?
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Split in two topic sets and
consider results with one
subset as the truth
8
How we Make Do
Don’t know true properties
of systems, such as mean or
variance over topics
s
t
𝑿
𝑿
?
𝑿
𝑿
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificial modifications of
effectiveness scores that
lead to invalid data
9
How we Make Do
Can’t control properties of
systems, such as true mean.
Systems are how they are
?
-1 10
-1 10
PROPOSAL
10
Stochastic Simulation
11
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
AP
P@10
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
Stochastic Simulation
12
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
• Fit the model to existing
data to make it realistic
• Needs to be flexible to
model real data
Stochastic Simulation
13
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
Model
• …of the joint distribution of system scores
• We use copula models, which separate:
1.Marginal distributions, of individual systems
2.Dependence structure, among systems
• Easy to customize: plug and play simulate
14
• Fit the model:
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
Model
16
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
Model
16
𝑉
𝑈
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
2. Turn into effectiveness
scores
Model
16
𝑌 = 𝐹𝑌
−1
𝑉
𝑉
𝑋=𝐹𝑋
−1
𝑈
𝑈
*generalizes
to several
systems
Modeling Dependences
• Gaussian copulas
– Only correlation
– Only symmetric
• R-Vine copulas
– Allows tail dependence
– Allows asymmetricity
– Built from pair-copulas
(bivariate)
– Eg: F(S1,S2), F(S4,S2|S1),
F(S4,S3|S1,S2), …
– ~40 alternatives based on 12
different families
17
Modeling Margins
• All effectiveness measures have discrete
distributions, but for some we can fairly
assume they’re continuous
–AP, nDCG
• For some others, this assumption is clearly
wrong, so we must preserve the support
–P@10: 1, 0.9, 0.8, …
–RR: 1, 1/2, 1/3, …
18
Modeling Margins
Continuous
• (Truncated) Normal
• Beta
• (Truncated) Normal Kernel
Smoothing
• Beta Kernel Smoothing
Discrete
• Beta-Binomial
• Discrete Kernel Smoothing
• Discrete Kernel Smoothing
w/ controlled smoothness
19
Transform to Predefined Mean
Problem: given 𝐹, transform to 𝐹 such that
𝝁 = 𝝁∗
and preserving the support
Solution: transform with a specific Beta
find 𝛼, 𝛽 > 1
such that 𝜇 = 𝜇∗
where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽
20
RESULTS
21
Data
• TREC Web Ad hoc runs 2010-2014
– 50 topics and 30-88 systems each
– 12924 total system-topic pairs
• Continuous measures: AP, nDCG@20, ERR@20
• Discrete measures: P@10, P@20, RR
• Points of Interest
1. Margins
2. Copulas
3. Simulated scores
22
• 1572 system-measure pairs
• 5425 models successfully fitted
• Log-Likelihood:
• Kernel Smoothing (esp. discrete)
• Normal & Beta 25% of cases
• AIC and BIC:
• Normal & Beta 67% of cases
• Beta-Binomial 50% of P@k
• Transform all to the mean in the
given data and select again:
• Kernel Smoothing nearly always
1. Margins
23
• 39627 system pairs
• Fit pair-copulas and select
according to Log-Likelihood
• Wide diversity
• Gaussian copulas rarely
selected; correlation is not
enough
• Complex models are preferred
2. Dependence
24
• Simulate 1000 new topics and record deviations from the model
• 𝜇 − 𝑋 and 𝜎2 − 𝑠2
• Repeat 1000 times
• Full knowledge of truth encoded in the model
3. Simulation: Scores
25
• Web 2010, nDCG@20
• Simulate 500 new topics
• Dependence captured in the model
3. Simulation: Dependencies
26
SAMPLE APPLICATIONS
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[Voorhees, 2009]
• Limited data
• Unknown truth (is H0 true?)
• No control over H0
• Cannot measure Type I error
rates directly
• Conflict rates at α=5%:
• AP: 2.8%
• P@10: 10.9%
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
1. Type I Errors
29
p-value
Type I
error?
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 5% and 1%
1. Type I Errors
30
p-value
Type I
error?
2. Sneak Peak: Statistical Power and σ
31
[Webber et al, 2008]
• Show empirical evidence of the problem of sequential testing
• Limited data
• Unknown truth (true σ)
TAKE HOME
Today
• Part of Evaluation Research has
data-related limitations
–Lack of data, no knowledge of truth, no control
–How valid are our results?
• We propose a methodology for stochastic
simulation to eliminate these limitations
–Flexible, realistic, highly customizable
–Allows us to study new problems, directly
33
Tomorrow
• Even more flexibility
• Simulate new systems for given topics
• Add third factors
–Fixed: already possible
–Random: we’ll see
• Simulate full runs (doc scores & relevance)
34
simIReff
• All results fully reproducible
• Developed a full R-package for simulation
https://github.com/julian-urbano/simIReff
effs <- effDiscFitAndSelect(data, support("p20"))
cop <- effcopFit(data, effs)
y <- reffcop(1000, cop)
35

Stochastic Simulation of Test Collections: Evaluation Scores

  • 1.
    SIGIR 2018 ·July 11th · Ann ArborPicture by qiyang
  • 2.
  • 3.
    Experiments in IR 3 Coreresearch “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 4.
    Experiments in IR 4 Coreresearch “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 5.
    Experiments in IR 5 Coreresearch “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 6.
    Current Limitations 1.Finite data 2.UnknownTruth 3.Lack of Control 6 How we Make Do
  • 7.
    Current Limitations 1.Finite data 2.UnknownTruth 3.Lack of Control Artificially create other collections by resampling from the existing data 7 How we Make Do Limited to dozens of systems and topics from past evaluations like TREC s t ? ?
  • 8.
    Current Limitations 1.Finite data 2.UnknownTruth 3.Lack of Control Split in two topic sets and consider results with one subset as the truth 8 How we Make Do Don’t know true properties of systems, such as mean or variance over topics s t 𝑿 𝑿 ? 𝑿 𝑿 ?
  • 9.
    Current Limitations 1.Finite data 2.UnknownTruth 3.Lack of Control Artificial modifications of effectiveness scores that lead to invalid data 9 How we Make Do Can’t control properties of systems, such as true mean. Systems are how they are ? -1 10 -1 10
  • 10.
  • 11.
    Stochastic Simulation 11 Core research “howwell?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model AP P@10
  • 12.
    • Build agenerative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control Stochastic Simulation 12 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 13.
    • Build agenerative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control • Fit the model to existing data to make it realistic • Needs to be flexible to model real data Stochastic Simulation 13 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 14.
    Model • …of thejoint distribution of system scores • We use copula models, which separate: 1.Marginal distributions, of individual systems 2.Dependence structure, among systems • Easy to customize: plug and play simulate 14
  • 15.
    • Fit themodel: Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 16.
    • Fit themodel: 1. Fit the margins Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 17.
    • Fit themodel: 1. Fit the margins 2. Turn to pseudo- observations Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 18.
    • Fit themodel: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 19.
    • Fit themodel: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will Model 16 *generalizes to several systems
  • 20.
    • Fit themodel: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations Model 16 𝑉 𝑈 *generalizes to several systems
  • 21.
    • Fit themodel: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations 2. Turn into effectiveness scores Model 16 𝑌 = 𝐹𝑌 −1 𝑉 𝑉 𝑋=𝐹𝑋 −1 𝑈 𝑈 *generalizes to several systems
  • 22.
    Modeling Dependences • Gaussiancopulas – Only correlation – Only symmetric • R-Vine copulas – Allows tail dependence – Allows asymmetricity – Built from pair-copulas (bivariate) – Eg: F(S1,S2), F(S4,S2|S1), F(S4,S3|S1,S2), … – ~40 alternatives based on 12 different families 17
  • 23.
    Modeling Margins • Alleffectiveness measures have discrete distributions, but for some we can fairly assume they’re continuous –AP, nDCG • For some others, this assumption is clearly wrong, so we must preserve the support –P@10: 1, 0.9, 0.8, … –RR: 1, 1/2, 1/3, … 18
  • 24.
    Modeling Margins Continuous • (Truncated)Normal • Beta • (Truncated) Normal Kernel Smoothing • Beta Kernel Smoothing Discrete • Beta-Binomial • Discrete Kernel Smoothing • Discrete Kernel Smoothing w/ controlled smoothness 19
  • 25.
    Transform to PredefinedMean Problem: given 𝐹, transform to 𝐹 such that 𝝁 = 𝝁∗ and preserving the support Solution: transform with a specific Beta find 𝛼, 𝛽 > 1 such that 𝜇 = 𝜇∗ where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽 20
  • 26.
  • 27.
    Data • TREC WebAd hoc runs 2010-2014 – 50 topics and 30-88 systems each – 12924 total system-topic pairs • Continuous measures: AP, nDCG@20, ERR@20 • Discrete measures: P@10, P@20, RR • Points of Interest 1. Margins 2. Copulas 3. Simulated scores 22
  • 28.
    • 1572 system-measurepairs • 5425 models successfully fitted • Log-Likelihood: • Kernel Smoothing (esp. discrete) • Normal & Beta 25% of cases • AIC and BIC: • Normal & Beta 67% of cases • Beta-Binomial 50% of P@k • Transform all to the mean in the given data and select again: • Kernel Smoothing nearly always 1. Margins 23
  • 29.
    • 39627 systempairs • Fit pair-copulas and select according to Log-Likelihood • Wide diversity • Gaussian copulas rarely selected; correlation is not enough • Complex models are preferred 2. Dependence 24
  • 30.
    • Simulate 1000new topics and record deviations from the model • 𝜇 − 𝑋 and 𝜎2 − 𝑠2 • Repeat 1000 times • Full knowledge of truth encoded in the model 3. Simulation: Scores 25
  • 31.
    • Web 2010,nDCG@20 • Simulate 500 new topics • Dependence captured in the model 3. Simulation: Dependencies 26
  • 32.
  • 33.
    [Voorhees, 2009] 1. TypeI Errors? 28 S1 S2
  • 34.
    [Voorhees, 2009] 1. TypeI Errors? 28 S1 S2
  • 35.
    [Voorhees, 2009] 1. TypeI Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 36.
    [Voorhees, 2009] • Limiteddata • Unknown truth (is H0 true?) • No control over H0 • Cannot measure Type I error rates directly • Conflict rates at α=5%: • AP: 2.8% • P@10: 10.9% 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    [With simulation] Same margins •Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% 1. Type I Errors 29 p-value Type I error?
  • 42.
    [With simulation] Same margins •Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 43.
    [With simulation] Same margins •Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 44.
    [With simulation] Same margins •Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 45.
    [With simulation] Same margins •Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 5% and 1% 1. Type I Errors 30 p-value Type I error?
  • 46.
    2. Sneak Peak:Statistical Power and σ 31 [Webber et al, 2008] • Show empirical evidence of the problem of sequential testing • Limited data • Unknown truth (true σ)
  • 47.
  • 48.
    Today • Part ofEvaluation Research has data-related limitations –Lack of data, no knowledge of truth, no control –How valid are our results? • We propose a methodology for stochastic simulation to eliminate these limitations –Flexible, realistic, highly customizable –Allows us to study new problems, directly 33
  • 49.
    Tomorrow • Even moreflexibility • Simulate new systems for given topics • Add third factors –Fixed: already possible –Random: we’ll see • Simulate full runs (doc scores & relevance) 34
  • 50.
    simIReff • All resultsfully reproducible • Developed a full R-package for simulation https://github.com/julian-urbano/simIReff effs <- effDiscFitAndSelect(data, support("p20")) cop <- effcopFit(data, effs) y <- reffcop(1000, cop) 35