ictir2016

A Simple and Effective Approach
to Score Standardisation
@tetsuyasakai
http://www.f.waseda.jp/tetsuya/sakai.html
September 15@ICTIR 2016 (Newark, DE, USA)

TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work

Hard topics, easy topics
Mean = 0.12
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2
Mean = 0.70

Low-variance topics, high-variance topics
standard
deviation = 0.08
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2 standard
deviation = 0.29

Score standardisation [Webber+08]
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
“average” in standard
deviation units?
Standardising factors

Now for every topic, mean = 0, variance = 1.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Comparisons across different topic sets and test collections are possible!

Standardised scores have the [-∞, ∞] range
and are not very convenient.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Transform them back into the [0,1] range!

std-CDF: use the cumulative density function of
the standard normal distribution [Webber+08]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
Each curve is
a topic, with
110 runs
represented
as dots
raw nDCG
std-CDF
nDCG

std-CDF: emphasises moderately high and
moderately low performers – is this a good thing?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
raw nDCG
std-CDF
nDCG
Moderately
high
Moderately
low

std-AB: How about a simple linear
transformation?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)
TREC04
raw nDCG

std-AB with clipping, with the range [0,1]
Let B=0.5 (“average” system)
Let A=0.15 so that 89% of scores fall within [0.05, 0.95]
(Chebyshev’s inequality)
For EXTREMELY good/bad systems…
This formula with (A,B) is used in educational
research: A=100, B=500 for SAT, GRE [Lodico+10],
A=10, B=50 for Japanese hensachi “standard scores”.

In practice, clipping does not happen often.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 raw nDCG
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 std-AB nDCG
Topic ID

Data for comparing raw vs. std-CDF vs. std-AB

Ranking runs by raw, std-CDF, std-AB
measures
For each test
collection, the
rankings of
standardizing
systems are
statistically
equivalent

Standardisation factors
j
i
raw
Standardising systems
j
i
std
< , >
Topics
Topics

Can the factors handle new systems properly?
j
i
raw
j
i
std
< , >
raw std
New
systems
New
systems
Can the new systems be evaluated fairly?
Topics
Topics

Leave one out (1)
QR = {QRj} QR’(t) = {QR’j(t)}
(0) Leave out t, T’(t) = T – {t}
(1) Compute measure (1’) Compute measure
N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T
Runs from Team t have been removed from the pooled systems
– are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out
by means of Kendall’s tau.
original qrels
Evaluating M runs
using N topics
qrels with unique contributions from
Team t with L runs removed
L runsM - L runs
Zobel’s original method [Zobel98] removed one run at a time
but removing the entire team is more realistic [Voorhees02].

Leave one out (2)
(2) Compute
factors
(2’) Compute
factors
(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}
N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}
L runs from t also removed from the
standardising systems
These L runs are standardised using standardisation
factors based on the (M – L) runs

Leave one out (3)
(2) Compute
factors
(2’) Compute
factors
N × M matrix W QR’(t),TN × M matrix W QR,T
(4a) std-CDF (4’a) std-CDF
Runs from Team t have been removed from the pooled systems AND
from the standardising systems – are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out by means of Kendall’s tau.

Leave one out (4)
(2) Compute
factors
(2’) Compute
factors
N × M matrix W QR’(t),T
N × M matrix P QR’(t),T
(4’b) std-AB
N × M matrix W QR,T
N × M matrix P QR,T
(4b) std-AB
(4a) std-CDF (4’a) std-CDF
Runs from Team t have been removed from the pooled systems AND
from the standardising systems – are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out by means of Kendall’s tau.

Leave one out results
Similar results for
TREC04, 05
can be found
in the paper.
Margin of error
for 95% CI
Runs outside the pooled and standardising systems can be evaluated fairly for both std-CDF and std-AB.

Discriminative power
• Conduct a significance test for every system pair and plot the p-values
• Discriminative measures = those with small p-values
• [Sakai06SIGIR] used the bootstrap test for every system pair but using
k pairwise tests independently means that the familywise error rate
can amount to 1-(1-α) [Carterette12, Ellis10].
• [Sakai12WWW] used the randomised Tukey HSD test
[Carterette12][Sakai14PROMISE] instead to ensure that the
familywise error rate is bounded above by α.
k
We also use randomised Tukey HSD.

With nDCG, std-CDF is more discriminative than
raw and std-AB scores…
Gets more statistically significant results,
probably because std-CDF emphasizes
moderately high and moderately low
scores

But with nERR, std-CDF is not discriminative
Probably because nERR is
seldom moderately high/low.

Swap test
• System X > Y with topic set A. Does X > Y also hold with topic set B?
• [Voorhees09] splits 100 topics in half to form A and B, each with 50.
• [Sakai06SIGIR] showed that bootstrap samples (sampling with
replacement) can directly handle the original topic set size.
:
Bin 1
Bin 2
Bin 21

With std-CDF, we get lots of swaps.
std-AB is much more consistent across topic sets.

What if we consider only run pairs that are statistically
significantly different according to randomised Tukey
HSD?
nDCG nERR
TREC03 (3,003 pairs) 810/844/812 378/357/386
TREC04 (5,995 pairs) 1434/1723/1534 223/220/250
TREC05 (2,701 pairs) 727/879/758 336/329/346
Significantly
different pairs
(raw/std-CDF/std-AB)
:
Bin 1’
Bin 2’
Bin 6’
Each bin now has a wider range as
the #observations is small

After filtering pairs with randomised Tukey HSD,
swaps almost never occur for all three score types
TREC03
TREC04
TREC05
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
Bins 1’～3’
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
Previous work
did not consider
the familywise
error rate problem
(used pairwise tests
many times)
#significant pairs for
nERR: 378
#observations: 378,000
#observations in Bin 1’:
980
#swaps in Bin 1’:
1 (0.10%)

Topic set size design [Sakai16IRJ,Sakai16ICTIRtutorial]
To determine the topic set size n for a new test collection to be built,
Sakai’s Excel tool based on one-way ANOVA power analysis takes as input:
α: Type I error probability
β: Type II error probability (power = 1 – β)
M: number of systems to be compared
minD: minimum detectable range
= minimum diff between the best and worst systems for which you want
to guarantee (1-β)% power
: estimate of the within-system variance (typically obtained from a
pilot topic-by-run matrix

Estimating the within-system variance for
each measure (to obtain future n)
TREC03
TREC04
TREC05
runs
topics
C=TREC03
C=TREC04
C=TREC05
Residual variances
from one-way ANOVA
Pooled variance
Sample mean for system i
Do this for raw, std-CDF, and
std-AB score matrices
to obtain n’s.

With std-AB, we get very small within-system
variances (1)
The initial estimate of n with the one-way ANOVA topic set size design
is given by [Nagata03]
where,
for (α, β)=(0.05, 0.20), λ ≒
So n will be small if is small.
With std-AB, is indeed small because A is small (e.g. 0.15) and it can
be shown that
Noncentrality parameter of a noncentral
chi-square distribution

With std-AB, we get very small within-system
variances (2)

std-AB gives us
more realistic topic
set sizes for small
minD values
• Does not mean that std-AB is
“better” than std-CDF and raw,
because a minD of (say) 0.02 in std-AB
nDCG is not equivalent to a minD of 0.02
in std-CDF or raw.
• Nevertheless, having realistic topic set
sizes for a variety of minD values is
probably a convenient feature.

If we had fewer teams, what would happen to the
standardisation factors? (1)
j
i
raw
< , >
Topics
j
i
raw
< , >
Topics
Remove k teams
If the standisation factors are similar,
that implies that we don’t need many systems
to obtain reliable values.

If we had fewer teams, what would happen to the
standardisation factors? (2)
Starting with 16 teams,
k=0,…,14 teams were removed from
the matrices before obtaining
standardisation factors.
Each line represents m・j or s・j
for a topic (CIs omitted for brevity).
They are quite stable, even when
k=14 teams have been removed.
That is, only a few teams are needed
to obtain reliable values
of m・j and s・j .

If we had fewer teams, what would happen to within-system
variances for std-AB? (1)
j
i
raw
< , >
Topics
j
i
raw
< , >
Topics
Remove k teams
If the variance estimates are similar,
that implies that we don’t need many systems
to obtain reliable values.

If we had fewer teams, what would happen to within-system
variances for std-AB? (2)
Each k had 10 trials so 95% CIs of
the variance estimates are shown.
The variance estimates are also
stable even if we remove a lot of
teams. That is, only a few teams are
needed to obtain reliable variance
estimates for topic set size design.
Using std-AB with topic set size
design also means that we can
handle unnormalised measures
without any problems [Sakai16AIRS].

Conclusions
• Advantages of score standardisation:
- removes topic hardness, enables comparison across test collections
- normalisation becomes unnecessary
• Advantages of std-AB over std-CDF:
Low within-system variances and therefore
- Substantially lower swap rates (higher consistency across different data)
- Enables us to consider realistic topic set sizes in topic set size design
• By-product: Using randomised Tukey HSD (instead of repeated pairwise
tests) can ensure that swaps almost never occur.
Swap rates for std-CDF can be higher than
those for raw scores, probably due to its
nonlinear transformation
std-AB is a good alternative to std-CDF.
If you want a p-value for every system pair, this test is highly recommended.

Shared resources
• All of the topic-by-run matrices created in our experiments are
available at https://waseda.box.com/ICTIR2016PACK
• Computing AP, Q-measure, nDCG, nERR etc.:
http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html
• Discriminative power by randomised Tukey HSD:
http://research.nii.ac.jp/ntcir/tools/discpower-en.html
• Topic set size design Excel tools:
http://www.f.waseda.jp/tetsuya/tools.html

We Want Web@NTCIR-13 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
New runs
pooled for
frozen + fresh
topics

NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
Official NTCIR-13
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13
systems
NOT released
based on
NTCIR-13
systems
released

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
Official NTCIR-14
results discussed
based on
NTCIR-13+14
systems
NOT released
based on
NTCIR-(13+)14
systems
released
Using the NTCIR-14 fresh
topics, compare new NTCIR-
14 runs with revived runs and
quantify progress.

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics

NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
based on
NTCIR-(13+14+)15
systems
released
Using the NTCIR-15 fresh
topics, compare new NTCIR-
15 runs with revived runs and
quantify progress.

NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
based on
NTCIR-13+14
systems
released
based on
NTCIR-13
systems
released
How do the standardisation
factors for each frozen topic
differ across the 3 rounds?
based on
NTCIR-13+14+15
systems
released
based on
NTCIR-(13+14+)15
systems
released

NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
based on
NTCIR-(13+14+)15
systems
released
Official NTCIR-15
results discussed
based on
NTCIR-13+14+15
systems
released
based on
NTCIR-13+14
systems
released
based on
NTCIR-13
systems
released
How do the NTCIR-15 system
rankings differ across the 3
rounds, with and w/o
standardisation?
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking

Selected references (1)
[Carterette12] Carterette: Multiple testing in statistical analysis of
systems-based information retrieval experiments, ACM TOIS 30(1),
2012.
[Ellis10] Ellis: The essential guide to effect sizes, Cambridge, 2010.
[Lodico+10] Lodico, Spaulding, Voegtle: Methods in educational
research, Jossey-Bass, 2010.

[Sakai06SIGIR] Sakai: Evaluating evaluation metrics based on the bootstrap, ACM
SIGIR 2006.
[Sakai12WWW] Sakai: Evaluation with Informational and Navigational Intents,
WWW 2012.
[Sakai14PROMISE] Sakai: Metrics, statistics, tests, PROMISE Winter School 2013
(LNCS 8173).
[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval Journal 19(3), 2016.
http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice,
ICTIR 2016 Tutorial.
http://www.slideshare.net/TetsuyaSakai/ictir2016tutorial-65845256
[Sakai16AIRS] Sakai: The Effect of Score Standardisation on Topic Set Size Design,
AIRS 2016, to appear.

[Voorhees02] Voorhees: The philosophy of information retrieval
evaluation, CLEF 2001.
[Voorhees09] Voorhees: Topic set size redux, ACM SIGIR 2009.
[Webber+08] Webber, Moffat, Zobel: Score standardisation for inter-
collection comparison of retrieval systems, ACM SIGIR 2008.
[Zobel98] Zobel: How reliable are the results of large-scale information
retrieval experiments? ACM SIGIR 1998.

ictir2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ictir2016

Similar to ictir2016 (20)

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

Recently uploaded

Recently uploaded (20)

ictir2016