SlideShare a Scribd company logo
1 of 59
Download to read offline
A Simple and Effective Approach
to Score Standardisation
@tetsuyasakai
http://www.f.waseda.jp/tetsuya/sakai.html
September 15@ICTIR 2016 (Newark, DE, USA)
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Hard topics, easy topics
Mean = 0.12
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2
Mean = 0.70
Low-variance topics, high-variance topics
standard
deviation = 0.08
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2 standard
deviation = 0.29
Score standardisation [Webber+08]
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
“average” in standard
deviation units?
Standardising factors
Now for every topic, mean = 0, variance = 1.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Comparisons across different topic sets and test collections are possible!
Standardised scores have the [-∞, ∞] range
and are not very convenient.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Transform them back into the [0,1] range!
std-CDF: use the cumulative density function of
the standard normal distribution [Webber+08]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
Each curve is
a topic, with
110 runs
represented
as dots
raw nDCG
std-CDF
nDCG
std-CDF: emphasises moderately high and
moderately low performers – is this a good thing?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
raw nDCG
std-CDF
nDCG
Moderately
high
Moderately
low
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
std-AB: How about a simple linear
transformation?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)
TREC04
raw nDCG
std-AB with clipping, with the range [0,1]
Let B=0.5 (“average” system)
Let A=0.15 so that 89% of scores fall within [0.05, 0.95]
(Chebyshev’s inequality)
For EXTREMELY good/bad systems…
This formula with (A,B) is used in educational
research: A=100, B=500 for SAT, GRE [Lodico+10],
A=10, B=50 for Japanese hensachi “standard scores”.
In practice, clipping does not happen often.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 raw nDCG
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 std-AB nDCG
Topic ID
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Data for comparing raw vs. std-CDF vs. std-AB
Ranking runs by raw, std-CDF, std-AB
measures
For each test
collection, the
rankings of
standardizing
systems are
statistically
equivalent
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Standardisation factors
j
i
raw
Standardising systems
j
i
std
Standardising systems
< , >
Topics
Topics
Can the factors handle new systems properly?
j
i
raw
Standardising systems
j
i
std
Standardising systems
< , >
raw std
New
systems
New
systems
Can the new systems be evaluated fairly?
Topics
Topics
Leave one out (1)
QR = {QRj} QR’(t) = {QR’j(t)}
(0) Leave out t, T’(t) = T – {t}
(1) Compute measure (1’) Compute measure
N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T
Runs from Team t have been removed from the pooled systems
– are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out
by means of Kendall’s tau.
original qrels
Evaluating M runs
using N topics
qrels with unique contributions from
Team t with L runs removed
L runsM - L runs
Zobel’s original method [Zobel98] removed one run at a time
but removing the entire team is more realistic [Voorhees02].
Leave one out (2)
QR = {QRj} QR’(t) = {QR’j(t)}
(0) Leave out t, T’(t) = T – {t}
(1) Compute measure (1’) Compute measure
N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T
(2) Compute
factors
(2’) Compute
factors
(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}
N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}
L runs from t also removed from the
standardising systems
These L runs are standardised using standardisation
factors based on the (M – L) runs
Leave one out (3)
QR = {QRj} QR’(t) = {QR’j(t)}
(0) Leave out t, T’(t) = T – {t}
(1) Compute measure (1’) Compute measure
N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T
(2) Compute
factors
(2’) Compute
factors
(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}
N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}
N × M matrix W QR’(t),TN × M matrix W QR,T
(4a) std-CDF (4’a) std-CDF
Runs from Team t have been removed from the pooled systems AND
from the standardising systems – are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out by means of Kendall’s tau.
Leave one out (4)
QR = {QRj} QR’(t) = {QR’j(t)}
(0) Leave out t, T’(t) = T – {t}
(1) Compute measure (1’) Compute measure
N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T
(2) Compute
factors
(2’) Compute
factors
(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}
N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}
N × M matrix W QR’(t),T
N × M matrix P QR’(t),T
(4’b) std-AB
N × M matrix W QR,T
N × M matrix P QR,T
(4b) std-AB
(4a) std-CDF (4’a) std-CDF
Runs from Team t have been removed from the pooled systems AND
from the standardising systems – are these “new” runs evaluated fairly?
Compare two run rankings before and after leave one out by means of Kendall’s tau.
Leave one out results
Similar results for
TREC04, 05
can be found
in the paper.
Margin of error
for 95% CI
Runs outside the pooled and standardising systems can be evaluated fairly for both std-CDF and std-AB.
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Discriminative power
• Conduct a significance test for every system pair and plot the p-values
• Discriminative measures = those with small p-values
• [Sakai06SIGIR] used the bootstrap test for every system pair but using
k pairwise tests independently means that the familywise error rate
can amount to 1-(1-α) [Carterette12, Ellis10].
• [Sakai12WWW] used the randomised Tukey HSD test
[Carterette12][Sakai14PROMISE] instead to ensure that the
familywise error rate is bounded above by α.
k
We also use randomised Tukey HSD.
With nDCG, std-CDF is more discriminative than
raw and std-AB scores…
Gets more statistically significant results,
probably because std-CDF emphasizes
moderately high and moderately low
scores
But with nERR, std-CDF is not discriminative
Probably because nERR is
seldom moderately high/low.
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Swap test
• System X > Y with topic set A. Does X > Y also hold with topic set B?
• [Voorhees09] splits 100 topics in half to form A and B, each with 50.
• [Sakai06SIGIR] showed that bootstrap samples (sampling with
replacement) can directly handle the original topic set size.
:
Bin 1
Bin 2
Bin 21
With std-CDF, we get lots of swaps.
std-AB is much more consistent across topic sets.
What if we consider only run pairs that are statistically
significantly different according to randomised Tukey
HSD?
nDCG nERR
TREC03 (3,003 pairs) 810/844/812 378/357/386
TREC04 (5,995 pairs) 1434/1723/1534 223/220/250
TREC05 (2,701 pairs) 727/879/758 336/329/346
Significantly
different pairs
(raw/std-CDF/std-AB)
:
Bin 1’
Bin 2’
Bin 6’
Each bin now has a wider range as
the #observations is small
After filtering pairs with randomised Tukey HSD,
swaps almost never occur for all three score types
TREC03
TREC04
TREC05
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
Bins 1’~3’
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
[0, 0.10)
[0.10, 0.20)
[0.20, 0.30)
Previous work
did not consider
the familywise
error rate problem
(used pairwise tests
many times)
#significant pairs for
nERR: 378
#observations: 378,000
#observations in Bin 1’:
980
#swaps in Bin 1’:
1 (0.10%)
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Topic set size design [Sakai16IRJ,Sakai16ICTIRtutorial]
To determine the topic set size n for a new test collection to be built,
Sakai’s Excel tool based on one-way ANOVA power analysis takes as input:
α: Type I error probability
β: Type II error probability (power = 1 – β)
M: number of systems to be compared
minD: minimum detectable range
= minimum diff between the best and worst systems for which you want
to guarantee (1-β)% power
: estimate of the within-system variance (typically obtained from a
pilot topic-by-run matrix
Estimating the within-system variance for
each measure (to obtain future n)
TREC03
TREC04
TREC05
runs
topics
C=TREC03
C=TREC04
C=TREC05
Residual variances
from one-way ANOVA
Pooled variance
Sample mean for system i
Do this for raw, std-CDF, and
std-AB score matrices
to obtain n’s.
With std-AB, we get very small within-system
variances (1)
The initial estimate of n with the one-way ANOVA topic set size design
is given by [Nagata03]
where,
for (α, β)=(0.05, 0.20), λ ≒
So n will be small if is small.
With std-AB, is indeed small because A is small (e.g. 0.15) and it can
be shown that
Noncentrality parameter of a noncentral
chi-square distribution
With std-AB, we get very small within-system
variances (2)
std-AB gives us
more realistic topic
set sizes for small
minD values
• Does not mean that std-AB is
“better” than std-CDF and raw,
because a minD of (say) 0.02 in std-AB
nDCG is not equivalent to a minD of 0.02
in std-CDF or raw.
• Nevertheless, having realistic topic set
sizes for a variety of minD values is
probably a convenient feature.
If we had fewer teams, what would happen to the
standardisation factors? (1)
j
i
raw
Standardising systems
< , >
Topics
j
i
raw
Standardising systems
< , >
Topics
Remove k teams
If the standisation factors are similar,
that implies that we don’t need many systems
to obtain reliable values.
If we had fewer teams, what would happen to the
standardisation factors? (2)
Starting with 16 teams,
k=0,…,14 teams were removed from
the matrices before obtaining
standardisation factors.
Each line represents m・j or s・j
for a topic (CIs omitted for brevity).
They are quite stable, even when
k=14 teams have been removed.
That is, only a few teams are needed
to obtain reliable values
of m・j and s・j .
If we had fewer teams, what would happen to within-system
variances for std-AB? (1)
j
i
raw
Standardising systems
< , >
Topics
j
i
raw
Standardising systems
< , >
Topics
Remove k teams
If the variance estimates are similar,
that implies that we don’t need many systems
to obtain reliable values.
If we had fewer teams, what would happen to within-system
variances for std-AB? (2)
Each k had 10 trials so 95% CIs of
the variance estimates are shown.
The variance estimates are also
stable even if we remove a lot of
teams. That is, only a few teams are
needed to obtain reliable variance
estimates for topic set size design.
Using std-AB with topic set size
design also means that we can
handle unnormalised measures
without any problems [Sakai16AIRS].
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
Conclusions
• Advantages of score standardisation:
- removes topic hardness, enables comparison across test collections
- normalisation becomes unnecessary
• Advantages of std-AB over std-CDF:
Low within-system variances and therefore
- Substantially lower swap rates (higher consistency across different data)
- Enables us to consider realistic topic set sizes in topic set size design
• By-product: Using randomised Tukey HSD (instead of repeated pairwise
tests) can ensure that swaps almost never occur.
Swap rates for std-CDF can be higher than
those for raw scores, probably due to its
nonlinear transformation
std-AB is a good alternative to std-CDF.
If you want a p-value for every system pair, this test is highly recommended.
Shared resources
• All of the topic-by-run matrices created in our experiments are
available at https://waseda.box.com/ICTIR2016PACK
• Computing AP, Q-measure, nDCG, nERR etc.:
http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html
• Discriminative power by randomised Tukey HSD:
http://research.nii.ac.jp/ntcir/tools/discpower-en.html
• Topic set size design Excel tools:
http://www.f.waseda.jp/tetsuya/tools.html
TALK OUTLINE
1. Score standardisation and std-CDF
2. Proposed method: std-AB
3. Data and measures
4. Handling new systems: Leave one out
5. Discriminative power
6. Swap rates
7. Topic set size design
8. Conclusions
9. Future work
We Want Web@NTCIR-13 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
New runs
pooled for
frozen + fresh
topics
We Want Web@NTCIR-13 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
Official NTCIR-13
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13
systems
NOT released
Qrels + std. factors
based on
NTCIR-13
systems
released
We Want Web@NTCIR-14 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-14 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
Official NTCIR-14
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
NOT released
Qrels + std. factors
based on
NTCIR-(13+)14
systems
released
Using the NTCIR-14 fresh
topics, compare new NTCIR-
14 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-15 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Using the NTCIR-15 fresh
topics, compare new NTCIR-
15 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (3)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the standardisation
factors for each frozen topic
differ across the 3 rounds?
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
We Want Web@NTCIR-15 (4)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the NTCIR-15 system
rankings differ across the 3
rounds, with and w/o
standardisation?
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
See you all in Tokyo
Selected references (1)
[Carterette12] Carterette: Multiple testing in statistical analysis of
systems-based information retrieval experiments, ACM TOIS 30(1),
2012.
[Ellis10] Ellis: The essential guide to effect sizes, Cambridge, 2010.
[Lodico+10] Lodico, Spaulding, Voegtle: Methods in educational
research, Jossey-Bass, 2010.
Selected references (2)
[Sakai06SIGIR] Sakai: Evaluating evaluation metrics based on the bootstrap, ACM
SIGIR 2006.
[Sakai12WWW] Sakai: Evaluation with Informational and Navigational Intents,
WWW 2012.
[Sakai14PROMISE] Sakai: Metrics, statistics, tests, PROMISE Winter School 2013
(LNCS 8173).
[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval Journal 19(3), 2016.
http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice,
ICTIR 2016 Tutorial.
http://www.slideshare.net/TetsuyaSakai/ictir2016tutorial-65845256
[Sakai16AIRS] Sakai: The Effect of Score Standardisation on Topic Set Size Design,
AIRS 2016, to appear.
Selected references (3)
[Voorhees02] Voorhees: The philosophy of information retrieval
evaluation, CLEF 2001.
[Voorhees09] Voorhees: Topic set size redux, ACM SIGIR 2009.
[Webber+08] Webber, Moffat, Zobel: Score standardisation for inter-
collection comparison of retrieval systems, ACM SIGIR 2008.
[Zobel98] Zobel: How reliable are the results of large-scale information
retrieval experiments? ACM SIGIR 1998.

More Related Content

What's hot

Admission in India
Admission in IndiaAdmission in India
Admission in IndiaEdhole.com
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Marina Santini
 
Two-sample Hypothesis Tests
Two-sample Hypothesis Tests Two-sample Hypothesis Tests
Two-sample Hypothesis Tests mgbardossy
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Sherri Gunder
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2NBER
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
t-tests in R - Lab slides for UGA course FANR 6750
t-tests in R - Lab slides for UGA course FANR 6750t-tests in R - Lab slides for UGA course FANR 6750
t-tests in R - Lab slides for UGA course FANR 6750richardchandler
 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafikyusufbf
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 

What's hot (20)

Admission in India
Admission in IndiaAdmission in India
Admission in India
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Two-sample Hypothesis Tests
Two-sample Hypothesis Tests Two-sample Hypothesis Tests
Two-sample Hypothesis Tests
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Machine learning
Machine learningMachine learning
Machine learning
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
t-tests in R - Lab slides for UGA course FANR 6750
t-tests in R - Lab slides for UGA course FANR 6750t-tests in R - Lab slides for UGA course FANR 6750
t-tests in R - Lab slides for UGA course FANR 6750
 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 

Similar to ictir2016

Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment Parinda Rajapaksha
 
Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesXavier Ochoa
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
Accelerated life testing
Accelerated life testingAccelerated life testing
Accelerated life testingSteven Li
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...Smarten Augmented Analytics
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Quality perception of coding artifacts and packet loss in networked video com...
Quality perception of coding artifacts and packet loss in networked video com...Quality perception of coding artifacts and packet loss in networked video com...
Quality perception of coding artifacts and packet loss in networked video com...soojin kim
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingAkos Hajdu
 
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity InterventionVictor Asanza
 

Similar to ictir2016 (20)

Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal Features
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
report
reportreport
report
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
The Right Way
The Right WayThe Right Way
The Right Way
 
Accelerated life testing
Accelerated life testingAccelerated life testing
Accelerated life testing
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
lecture8.ppt
lecture8.pptlecture8.ppt
lecture8.ppt
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Quality perception of coding artifacts and packet loss in networked video com...
Quality perception of coding artifacts and packet loss in networked video com...Quality perception of coding artifacts and packet loss in networked video com...
Quality perception of coding artifacts and packet loss in networked video com...
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
 
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
Nl201609
Nl201609Nl201609
Nl201609
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
 
Short Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffShort Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 Kickoff
 
NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

ictir2016

  • 1. A Simple and Effective Approach to Score Standardisation @tetsuyasakai http://www.f.waseda.jp/tetsuya/sakai.html September 15@ICTIR 2016 (Newark, DE, USA)
  • 2. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 3. Hard topics, easy topics Mean = 0.12 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 Mean = 0.70
  • 4. Low-variance topics, high-variance topics standard deviation = 0.08 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 standard deviation = 0.29
  • 5. Score standardisation [Webber+08] standardised score for i-th system, j-th topic j i raw Topics Systems j i std Topics Systems Subtract mean; divide by standard deviation How good is i compared to “average” in standard deviation units? Standardising factors
  • 6. Now for every topic, mean = 0, variance = 1. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Comparisons across different topic sets and test collections are possible!
  • 7. Standardised scores have the [-∞, ∞] range and are not very convenient. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Transform them back into the [0,1] range!
  • 8. std-CDF: use the cumulative density function of the standard normal distribution [Webber+08] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 Each curve is a topic, with 110 runs represented as dots raw nDCG std-CDF nDCG
  • 9. std-CDF: emphasises moderately high and moderately low performers – is this a good thing? 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 raw nDCG std-CDF nDCG Moderately high Moderately low
  • 10. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 11. std-AB: How about a simple linear transformation? 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15) TREC04 raw nDCG
  • 12. std-AB with clipping, with the range [0,1] Let B=0.5 (“average” system) Let A=0.15 so that 89% of scores fall within [0.05, 0.95] (Chebyshev’s inequality) For EXTREMELY good/bad systems… This formula with (A,B) is used in educational research: A=100, B=500 for SAT, GRE [Lodico+10], A=10, B=50 for Japanese hensachi “standard scores”.
  • 13. In practice, clipping does not happen often. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 raw nDCG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 std-AB nDCG Topic ID
  • 14. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 15. Data for comparing raw vs. std-CDF vs. std-AB
  • 16. Ranking runs by raw, std-CDF, std-AB measures For each test collection, the rankings of standardizing systems are statistically equivalent
  • 17. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 19. Can the factors handle new systems properly? j i raw Standardising systems j i std Standardising systems < , > raw std New systems New systems Can the new systems be evaluated fairly? Topics Topics
  • 20. Leave one out (1) QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t} (1) Compute measure (1’) Compute measure N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T Runs from Team t have been removed from the pooled systems – are these “new” runs evaluated fairly? Compare two run rankings before and after leave one out by means of Kendall’s tau. original qrels Evaluating M runs using N topics qrels with unique contributions from Team t with L runs removed L runsM - L runs Zobel’s original method [Zobel98] removed one run at a time but removing the entire team is more realistic [Voorhees02].
  • 21. Leave one out (2) QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t} (1) Compute measure (1’) Compute measure N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T (2) Compute factors (2’) Compute factors (3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >} N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t} L runs from t also removed from the standardising systems These L runs are standardised using standardisation factors based on the (M – L) runs
  • 22. Leave one out (3) QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t} (1) Compute measure (1’) Compute measure N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T (2) Compute factors (2’) Compute factors (3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >} N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t} N × M matrix W QR’(t),TN × M matrix W QR,T (4a) std-CDF (4’a) std-CDF Runs from Team t have been removed from the pooled systems AND from the standardising systems – are these “new” runs evaluated fairly? Compare two run rankings before and after leave one out by means of Kendall’s tau.
  • 23. Leave one out (4) QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t} (1) Compute measure (1’) Compute measure N × (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N × M matrix R QR,T (2) Compute factors (2’) Compute factors (3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >} N × M matrix S QR,T N × (M – L) matrix S QR’(t),T’(t) S QR’(t),{t} N × M matrix W QR’(t),T N × M matrix P QR’(t),T (4’b) std-AB N × M matrix W QR,T N × M matrix P QR,T (4b) std-AB (4a) std-CDF (4’a) std-CDF Runs from Team t have been removed from the pooled systems AND from the standardising systems – are these “new” runs evaluated fairly? Compare two run rankings before and after leave one out by means of Kendall’s tau.
  • 24. Leave one out results Similar results for TREC04, 05 can be found in the paper. Margin of error for 95% CI Runs outside the pooled and standardising systems can be evaluated fairly for both std-CDF and std-AB.
  • 25. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 26. Discriminative power • Conduct a significance test for every system pair and plot the p-values • Discriminative measures = those with small p-values • [Sakai06SIGIR] used the bootstrap test for every system pair but using k pairwise tests independently means that the familywise error rate can amount to 1-(1-α) [Carterette12, Ellis10]. • [Sakai12WWW] used the randomised Tukey HSD test [Carterette12][Sakai14PROMISE] instead to ensure that the familywise error rate is bounded above by α. k We also use randomised Tukey HSD.
  • 27. With nDCG, std-CDF is more discriminative than raw and std-AB scores… Gets more statistically significant results, probably because std-CDF emphasizes moderately high and moderately low scores
  • 28. But with nERR, std-CDF is not discriminative Probably because nERR is seldom moderately high/low.
  • 29. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 30. Swap test • System X > Y with topic set A. Does X > Y also hold with topic set B? • [Voorhees09] splits 100 topics in half to form A and B, each with 50. • [Sakai06SIGIR] showed that bootstrap samples (sampling with replacement) can directly handle the original topic set size. : Bin 1 Bin 2 Bin 21
  • 31. With std-CDF, we get lots of swaps. std-AB is much more consistent across topic sets.
  • 32. What if we consider only run pairs that are statistically significantly different according to randomised Tukey HSD? nDCG nERR TREC03 (3,003 pairs) 810/844/812 378/357/386 TREC04 (5,995 pairs) 1434/1723/1534 223/220/250 TREC05 (2,701 pairs) 727/879/758 336/329/346 Significantly different pairs (raw/std-CDF/std-AB) : Bin 1’ Bin 2’ Bin 6’ Each bin now has a wider range as the #observations is small
  • 33. After filtering pairs with randomised Tukey HSD, swaps almost never occur for all three score types TREC03 TREC04 TREC05 [0, 0.10) [0.10, 0.20) [0.20, 0.30) Bins 1’~3’ [0, 0.10) [0.10, 0.20) [0.20, 0.30) [0, 0.10) [0.10, 0.20) [0.20, 0.30) Previous work did not consider the familywise error rate problem (used pairwise tests many times) #significant pairs for nERR: 378 #observations: 378,000 #observations in Bin 1’: 980 #swaps in Bin 1’: 1 (0.10%)
  • 34. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 35. Topic set size design [Sakai16IRJ,Sakai16ICTIRtutorial] To determine the topic set size n for a new test collection to be built, Sakai’s Excel tool based on one-way ANOVA power analysis takes as input: α: Type I error probability β: Type II error probability (power = 1 – β) M: number of systems to be compared minD: minimum detectable range = minimum diff between the best and worst systems for which you want to guarantee (1-β)% power : estimate of the within-system variance (typically obtained from a pilot topic-by-run matrix
  • 36. Estimating the within-system variance for each measure (to obtain future n) TREC03 TREC04 TREC05 runs topics C=TREC03 C=TREC04 C=TREC05 Residual variances from one-way ANOVA Pooled variance Sample mean for system i Do this for raw, std-CDF, and std-AB score matrices to obtain n’s.
  • 37. With std-AB, we get very small within-system variances (1) The initial estimate of n with the one-way ANOVA topic set size design is given by [Nagata03] where, for (α, β)=(0.05, 0.20), λ ≒ So n will be small if is small. With std-AB, is indeed small because A is small (e.g. 0.15) and it can be shown that Noncentrality parameter of a noncentral chi-square distribution
  • 38. With std-AB, we get very small within-system variances (2)
  • 39. std-AB gives us more realistic topic set sizes for small minD values • Does not mean that std-AB is “better” than std-CDF and raw, because a minD of (say) 0.02 in std-AB nDCG is not equivalent to a minD of 0.02 in std-CDF or raw. • Nevertheless, having realistic topic set sizes for a variety of minD values is probably a convenient feature.
  • 40. If we had fewer teams, what would happen to the standardisation factors? (1) j i raw Standardising systems < , > Topics j i raw Standardising systems < , > Topics Remove k teams If the standisation factors are similar, that implies that we don’t need many systems to obtain reliable values.
  • 41. If we had fewer teams, what would happen to the standardisation factors? (2) Starting with 16 teams, k=0,…,14 teams were removed from the matrices before obtaining standardisation factors. Each line represents m・j or s・j for a topic (CIs omitted for brevity). They are quite stable, even when k=14 teams have been removed. That is, only a few teams are needed to obtain reliable values of m・j and s・j .
  • 42. If we had fewer teams, what would happen to within-system variances for std-AB? (1) j i raw Standardising systems < , > Topics j i raw Standardising systems < , > Topics Remove k teams If the variance estimates are similar, that implies that we don’t need many systems to obtain reliable values.
  • 43. If we had fewer teams, what would happen to within-system variances for std-AB? (2) Each k had 10 trials so 95% CIs of the variance estimates are shown. The variance estimates are also stable even if we remove a lot of teams. That is, only a few teams are needed to obtain reliable variance estimates for topic set size design. Using std-AB with topic set size design also means that we can handle unnormalised measures without any problems [Sakai16AIRS].
  • 44. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 45. Conclusions • Advantages of score standardisation: - removes topic hardness, enables comparison across test collections - normalisation becomes unnecessary • Advantages of std-AB over std-CDF: Low within-system variances and therefore - Substantially lower swap rates (higher consistency across different data) - Enables us to consider realistic topic set sizes in topic set size design • By-product: Using randomised Tukey HSD (instead of repeated pairwise tests) can ensure that swaps almost never occur. Swap rates for std-CDF can be higher than those for raw scores, probably due to its nonlinear transformation std-AB is a good alternative to std-CDF. If you want a p-value for every system pair, this test is highly recommended.
  • 46. Shared resources • All of the topic-by-run matrices created in our experiments are available at https://waseda.box.com/ICTIR2016PACK • Computing AP, Q-measure, nDCG, nERR etc.: http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html • Discriminative power by randomised Tukey HSD: http://research.nii.ac.jp/ntcir/tools/discpower-en.html • Topic set size design Excel tools: http://www.f.waseda.jp/tetsuya/tools.html
  • 47. TALK OUTLINE 1. Score standardisation and std-CDF 2. Proposed method: std-AB 3. Data and measures 4. Handling new systems: Leave one out 5. Discriminative power 6. Swap rates 7. Topic set size design 8. Conclusions 9. Future work
  • 48. We Want Web@NTCIR-13 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems New runs pooled for frozen + fresh topics
  • 49. We Want Web@NTCIR-13 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems Official NTCIR-13 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13 systems NOT released Qrels + std. factors based on NTCIR-13 systems released
  • 50. We Want Web@NTCIR-14 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 51. We Want Web@NTCIR-14 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems Official NTCIR-14 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems NOT released Qrels + std. factors based on NTCIR-(13+)14 systems released Using the NTCIR-14 fresh topics, compare new NTCIR- 14 runs with revived runs and quantify progress.
  • 52. We Want Web@NTCIR-15 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 53. We Want Web@NTCIR-15 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-(13+14+)15 systems released Using the NTCIR-15 fresh topics, compare new NTCIR- 15 runs with revived runs and quantify progress.
  • 54. We Want Web@NTCIR-15 (3) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the standardisation factors for each frozen topic differ across the 3 rounds? Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-(13+14+)15 systems released
  • 55. We Want Web@NTCIR-15 (4) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Qrels + std. factors based on NTCIR-(13+14+)15 systems released Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the NTCIR-15 system rankings differ across the 3 rounds, with and w/o standardisation? NTCIR-15 systems ranking NTCIR-15 systems ranking NTCIR-15 systems ranking
  • 56. See you all in Tokyo
  • 57. Selected references (1) [Carterette12] Carterette: Multiple testing in statistical analysis of systems-based information retrieval experiments, ACM TOIS 30(1), 2012. [Ellis10] Ellis: The essential guide to effect sizes, Cambridge, 2010. [Lodico+10] Lodico, Spaulding, Voegtle: Methods in educational research, Jossey-Bass, 2010.
  • 58. Selected references (2) [Sakai06SIGIR] Sakai: Evaluating evaluation metrics based on the bootstrap, ACM SIGIR 2006. [Sakai12WWW] Sakai: Evaluation with Informational and Navigational Intents, WWW 2012. [Sakai14PROMISE] Sakai: Metrics, statistics, tests, PROMISE Winter School 2013 (LNCS 8173). [Sakai16IRJ] Sakai: Topic set size design, Information Retrieval Journal 19(3), 2016. http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf [Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice, ICTIR 2016 Tutorial. http://www.slideshare.net/TetsuyaSakai/ictir2016tutorial-65845256 [Sakai16AIRS] Sakai: The Effect of Score Standardisation on Topic Set Size Design, AIRS 2016, to appear.
  • 59. Selected references (3) [Voorhees02] Voorhees: The philosophy of information retrieval evaluation, CLEF 2001. [Voorhees09] Voorhees: Topic set size redux, ACM SIGIR 2009. [Webber+08] Webber, Moffat, Zobel: Score standardisation for inter- collection comparison of retrieval systems, ACM SIGIR 2008. [Zobel98] Zobel: How reliable are the results of large-scale information retrieval experiments? ACM SIGIR 1998.