The Effect of
Score Standardisation on
Topic Set Size Design
@tetsuyasakai
Waseda University, Japan
http://www.f.waseda.jp/tetsuya/sakai.html
November 30, 2016@AIRS 2016, Beijing.
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Hard topics, easy topics
Mean = 0.12
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2
Mean = 0.70
Low-variance topics, high-variance topics
standard
deviation = 0.08
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2 standard
deviation = 0.29
Score standardisation [Webber+08]
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
“average” in standard
deviation units?
Standardising factors
Now for every topic, mean = 0, variance = 1.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Comparisons across different topic sets and test collections are possible!
Standardised scores have the [-∞, ∞] range
and are not very convenient.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Transform them back into the [0,1] range!
std-CDF: use the cumulative density function of
the standard normal distribution [Webber+08]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
Each curve is
a topic, with
110 runs
represented
as dots
raw nDCG
std-CDF
nDCG
std-CDF: emphasises moderately high and
moderately low performers – is this a good thing?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
raw nDCG
std-CDF
nDCG
Moderately
high
Moderately
low
std-AB: How about a simple linear
transformation? [Sakai16ICTIR]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)
TREC04
raw nDCG
std-AB with clipping, with the range [0,1]
Let B=0.5 (“average” system)
Let A=0.15 so that 89% of scores fall within [0.05, 0.95]
(Chebyshev’s inequality)
For EXTREMELY good/bad systems…
This formula with (A,B) is used in educational
research: A=100, B=500 for SAT, GRE [Lodico+10],
A=10, B=50 for Japanese hensachi “standard scores”.
In practice, clipping does not happen often.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 raw nDCG
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 std-AB nDCG
Topic ID
[Sakai16ICTIR] bottom line
• Advantages of score standardisation:
- removes topic hardness, enables comparison across test collections
- normalisation becomes unnecessary
• Advantages of std-AB over std-CDF:
Low within-system variances and therefore
- Substantially lower swap rates (higher consistency across different
data)
- Enables us to consider realistic topic set sizes in topic set design
Swap rates for std-CDF can be higher than
those for raw scores, probably due to its
nonlinear transformation
std-AB is a good alternative to std-CDF.
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Topic set size design (1) [Sakai16IRJ]
• Provides answers to the following question:
“I’m building a new test collection. How many topics should I create?”
• A prerequisite: a small topic-by-run score matrix based on pilot data,
for estimating within-system variances.
• Three approaches (with easy-to-use Excel tools), based on
[Nagata03]:
(1) paired t-test power
(2) one-way ANOVA power
(3) confidence interval width upperbound.
Topic set size design (2) [Sakai16IRJ]
Method Input required
Paired t-test α (Type I error probability), β (Type II error probability),
minDt (minimum detectable difference: whenever the diff between two systems is this
much or larger, we want to guarantee (1-β)% power),
: variance estimate for the score delta.
one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-β)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Confidence intervals α (Type I error probability),
δ (CI width upperbound: you want the CI for the diff between any system pair to be this
much or smaller),
: variance estimate for the score delta.
Topic set size design (3) [Sakai16IRJ]
Test collection designs should evolve based on past data
topic-by-run
score matrix with
pilot data
About 25 topics
with runs from
a few teams
probably sufficient
[Sakai+16EVIA]
n1 topics
m runs
Estimate n1 based on the
within-system variance
estimate
TREC 201X TREC 201(X+1)
n2 topics
n0 topics
Estimate n2 based on the
within-system variance
estimate
A more accurate estimate
Topic set size design (4) [Sakai16IRJ]
ANOVA-based results for
m=10 can be used instead
of CI-based results
ANOVA-based results for
m=2 can be used instead of
t-test-based results
In practice, you can deduce t-test-based and CI-based results from ANOVA-based results
Caveat: the ANOVA-based tool can only
handle (α, β)=(0.05, 0.20), (0.01, 0.20),
(0.05, 0.10), (0.01, 0.10).
Method Input required
one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-β)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Example situation: You plan to compare m systems with one-way ANOVA with
α=5%. You plan to use nDCG as a primary evaluation measure, and want to
guarantee 80% power whenever the diff between the best and the worst systems
>= minD.
You know from pilot data that the within-system variance for nDCG is around .
What is the required number of topics n?
Topic set size design with
one-way ANOVA (1) m systems
best
worst
minD <= D
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
will do this for you! Use the appropriate sheet for a given (α, β) and fill
out the orange cells.
:
n=20 is what you
want!
Topic set size design with
one-way ANOVA (2)
Estimating the variance (1)
We need for topic set size design based on one-way ANOVA
and for that based on the paired t-test or CI.
From a pilot topic-by-run score matrix, obtain:
Then, if possible, pool multiple estimates to enhance accuracy:
Pooled estimate
By-product of one-way
ANOVA
(use two-way w/o
replilcation for tighter
estimates)
Multiple
data not
available
in this study
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Variances obtained from NTCIR-12 tasks
mC nC
Variances
are substantially
smaller
after applying
std-AB.
Unnormalised
measures can
be handled
without any
problems.
Why the variances are smaller after applying std-AB
The initial estimate of n with the one-way ANOVA topic set size design
is given by [Nagata03]
where,
for (α, β)=(0.05, 0.20), λ ≒
So n will be small if is small.
With std-AB, is indeed small because A is small (e.g. 0.15) and it can
be shown that
Noncentrality parameter of a noncentral
chi-square distribution
System rankings before and after applying std-AB
mC nC
System rankings
before and after
applying std-AB
are statistically
equivalent.
std-AB enables
cross-collection
comparisons
without affecting
within-collection
comparisons!
MedNLPDoc (1) [Aramaki+16]
https://sites.google.com/site/mednlpdoc/
• INPUT: a medical record
• OUTPUT: ICD (international classification of diseases) codes of
possible disease names
• MEASURES: precision and recall of ICDs
precision
recall
14 runs 14 runs
78 topics
76 topics
MedNLPDoc (2) [Aramaki+16]
https://sites.google.com/site/mednlpdoc/
76 topics
Raw recall:
- Lots of 0’s
- Some 1’s
std-AB recall:
- No 0’s
- Fewer 1’s
0
100
200
300
400
500
600
700
0
50
100
150
200
250
300
350
score range score range
MobileClick-2 iUnit ranking (1) [Kato+16]
http://mobileclick.org/
• INPUT: iUnits (relevant nuggets for a mobile search summary)
• OUTPUT: iUnits ranked by relevance
• MEASURES:
nDCG [Jarvelin+02]
= Σ g(r)/log(r+1) / Σ g*(r)/log(r+1)
Q-measure [Sakai05AIRS04]
= (1/R) Σ I(r) BR(r) where BR(r) = ( Σ I(k) + β Σ g(k) )/( r + βΣ g*(k) )
l
r=1
l
r=1
r
r
k=1
r
k=1
r
k=1
gain at r in an ideal list
1 if relevant, 0 otherwise
MobileClick-2 iUnit ranking (2) [Kato+16]
http://mobileclick.org/
Raw nDCG:
- hard topics, easy topics
0
100
200
300
400
500
600
700
0
100
200
300
400
500
600
700
std-AB nDCG:
- topics look more comparable
to one another
MobileClick-2 iUnit summarisation (1) [Kato+16]
http://mobileclick.org/
• INPUT: iUnits (relevant nuggets for a mobile search summary)
• OUTPUT: two-layered textual
summary
• MEASURES:
M-measure, a variant of the
intent-aware U-measure
[Sakai+13SIGIR]
M-measure is an unnormalised
measure: does not have the [0,1] range.
(Intent-aware measures difficult to normalise.)
[Kato+16]
MobileClick-2 iUnit summarisation (2) [Kato+16]
http://mobileclick.org/
Raw M-measure:
- unnormalised, unbounded,
extremely large variances
- topics definitely not comparable
(note the different scale of the y axis)
std-AB M-measure:
- no problem!
0
100
200
300
400
500
0
100
200
300
400
500
600
40-45 0.9-1.0
Clearly violates
i.i.d
STC (short text conversation) (1) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
• INPUT: a Weibo post (Chinese tweet)
• OUTPUT: a ranked list of Weibo posts from a repository that serve as valid
responses to the input
• MEASURES:
nG@1
(normalised gain at 1,
a.k.a. “nDCG@1”)
nERR@10
[Chapelle11]
P+ [Sakai06AIRS]
a variant of Q-measure
STC (short text conversation) (2) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
Raw P+:
- Lots of 1’s 0’s
- Gap in the [0.625, 1] range
(see previous slide)
std-AB P+:
- Looks like a continuous measure!
- Fewer 1’s
- No 0’s
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
0
500
1000
1500
0
500
1000
1500
STC (short text conversation) (3) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
Raw nG@1:
- 0 or 1/3 or 1!
0
1000
2000
3000
0
500
1000
1500
2000
2500
std-AB nG@1:
- Looks like a continuous measure!
- Fewer 1’s
- No 0’s
QALab-2 (1) [Shibuki+16]
http://research.nii.ac.jp/qalab/
• INPUT: a multiple-choice Japanese National Center Test (university
entrance exam) question on world history
• OUTPUT: choice deemed correct by system
• MEASURES:
Boolean: 1 (correct) or 0 (incorrect)
QALab-2 (2) [Shibuki+16]
http://research.nii.ac.jp/qalab/
36 topicsRaw Boolean:
- 0 or 1!
std-AB Boolean:
- Two distinct ranges of values
[0.2999, 0.4460] and [0.6091, 0.9047]
Normal assumption still clearly violated: our topic set size design
results should be interpreted as those for normally-distributed measures
that happen to have variances similar to Raw/std-AB Boolean.
QALab-2 organisers sorted the topics
by #systems_correctly_answered
before providing the matrices to the present author
0
200
400
600
800
0
200
400
600
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
A few recommendations for MedNLPDoc (1)
With raw recall:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
A few recommendations for MedNLPDoc (2)
With std-AB recall:
create 80 topics to guarantee 80% power for
- minD=0.05 for m=2 systems
- minD=0.10 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
A few recommendations for MobileClick-2 (1)
MobileClick-2 had 100 topics at NTCIR-12.
Topic set size needs to be set by considering both
subtasks, but raw M-measure cannot be handled
due to extremely large variance. If we only
consider iUnit ranking raw nDCG@3:
create 90 topics to guarantee 80% power for
- minD=0.10 for m=10 English systems
- minD=0.10 for m=2 Japanese systems
A few recommendations for MobileClick-2 (2)
MobileClick-2 had 100 topics at NTCIR-12.
With std-AB nDCG@3 and std-AB M-measure:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=20 English and m=30 Japanese
iUnit ranking systems
- minD=0.05 for m=10 English and m=10 Japanese
iUnit summarisation systems
A few recommendations for STC (1)
With (a normally distributed measure whose variance is similar to that of) raw nG@1:
create 120 topics to guarantee 80% power for
- minD=0.20 for m=20 systems
STC had
100 topics at NTCIR-12.
A few recommendations for STC (2)
STC had
100 topics at NTCIR-12.
With std-AB nG@1:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=30 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
A few recommendations for QALab-2 (1)
QALab-2 had
36-41 topics at NTCIR-12:
not sufficient from the
viewpoint of power
With (a normally distributed measure whose variance is similar to that of) raw Boolean:
create 90 topics to guarantee 80% power for
- minD=0.20 for m=2 systems
A few recommendations for QALab-2 (2)
QALab-2 had
36-41 topics at NTCIR-12.
With (a normally distributed measure whose variance is similar to that of) std-AB Boolean:
create 40 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Conclusions
• std-AB suppresses score variances and thereby enables test collection
builders to consider realistic choices of topic set sizes.
• topic set size design with std-AB can handle even unnormalised such
as M-measure (U-measure, TBG, alpha-nDCG, ERR-IA etc.).
• Even discrete measures such as nG@1 (0 or 1/3 or 1) look more
continuous after applying std-AB, which makes the topic set size
design results (based on normality and i.i.d assumptions) perhaps a
little more believable.
• Test collection designs should evolve based on experiences (i.e.
variances pooled from past data).
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
How long will the standardisation factors for
each topic remain valid?
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
“average” in standard
deviation units?
Standardising factors
These systems will
eventually
become outdated,
right?
We Want Web@NTCIR-13 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
New runs
pooled for
frozen + fresh
topics
We Want Web@NTCIR-13 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
Official NTCIR-13
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13
systems
NOT released
Qrels + std. factors
based on
NTCIR-13
systems
released
We Want Web@NTCIR-14 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-14 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
Official NTCIR-14
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
NOT released
Qrels + std. factors
based on
NTCIR-(13+)14
systems
released
Using the NTCIR-14 fresh
topics, compare new NTCIR-
14 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-15 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Using the NTCIR-15 fresh
topics, compare new NTCIR-
15 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (3)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the standardisation
factors for each frozen topic
differ across the 3 rounds?
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
We Want Web@NTCIR-15 (4)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the NTCIR-15 system
rankings differ across the 3
rounds, with and w/o
standardisation?
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
See you all in Tokyo, in August/December 2017!
Selected references (1)
[Aramaki+16] Aramaki et al.: Overview of the NTCIR-12 MedNLPDoc task, NTCIR-12
Proceedings, 2016.
[Carterette+08] Carterette et al.: Evaluation over Thousands of Queries, SIGIR 2008.
[Chapelle+11] Chapelle et al.: Intent-based Diversification of Web Search Results: Metrics
and Algorithms, Information Retrieval 14(6), 2011.
[Jarvelin+02] Jarvelin and Kelalainen: Cumulated Gain-based Evaluation of IR techniques,
ACM TOIS 20(4), 2002.
[Gilbert+79] Gilbert and Sparck Jones:, Statistical Bases of Relevance assessment for the
`IDEAL’ Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1979.
[Kato+16] Kato et al.: Overview of the NTCIR-12 MobileClick task, NTCIR-12 Proceedings,
2016.
[Nagata03] Nagata: How to Design the Sample Size (in Japanese), Asakura Shoten, 2003.
Selected references (2)
[Sakai05AIRS04] Sakai: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS
2004 (LNCS 3411), 2005.
[Sakai06AIRS] Sakai: Bootstrap-based Comparisons of IR Metrics for Finding One Relevant
Document, AIRS 2006 (LNCS 4182).
[Sakai+13SIGIR] Sakai and Dou: Summaries, Ranked Retrieval and Sessions: A Unified
Framework for Information Access Evaluation, SIGIR 2013.
[Sakai16ICTIR] Sakai: A simple and effective approach to score standardisaiton, ICTIR 2016.
[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice (tutorial),
ICTIR 2016.
[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval, 19(3), 2016. OPEN ACCESS:
http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
[Sakai+16EVIA] Sakai and Shang: On Estimating Variances for Topic Set Size Design, EVIA
2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/evia/02-
EVIA2016-SakaiT.pdf
Selected references (3)
[Shang+16] Shang et al.: Overview of the NTCIR-12 short text conversation task, NTCIR-12
Proceedings, 2016.
[Shibuki+16] Shibuki et al.: Overview of the NTCIR-12 QA Lab-2 task, NTCIR-12 Proceedings,
2016.
[SparckJones+75] Sparck Jones and Van Rijsbergen: Report on the Need for and Provision
on an `Ideal’ Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1975.
[Voorhees+05] Voorhees and Harman: TREC: Experiment and Evaluation in Information
Retrieval, The MIT Press, 2005.
[Voorhees09] Voorhees: Topic Set Size Redux, SIGIR 2009.
[Webber+08SIGIR] Webber, Moffat, Zobel: Score standardisation for inter-collection
comparison of retrieval systems, SIGIR 2008.
[Webber+08CIKM] Webber, Moffat, Zobel: Statistical power in retrieval experimentation,
CIKM 2008.

AIRS2016

  • 1.
    The Effect of ScoreStandardisation on Topic Set Size Design @tetsuyasakai Waseda University, Japan http://www.f.waseda.jp/tetsuya/sakai.html November 30, 2016@AIRS 2016, Beijing.
  • 2.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 3.
    Hard topics, easytopics Mean = 0.12 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 Mean = 0.70
  • 4.
    Low-variance topics, high-variancetopics standard deviation = 0.08 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 standard deviation = 0.29
  • 5.
    Score standardisation [Webber+08] standardisedscore for i-th system, j-th topic j i raw Topics Systems j i std Topics Systems Subtract mean; divide by standard deviation How good is i compared to “average” in standard deviation units? Standardising factors
  • 6.
    Now for everytopic, mean = 0, variance = 1. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Comparisons across different topic sets and test collections are possible!
  • 7.
    Standardised scores havethe [-∞, ∞] range and are not very convenient. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Transform them back into the [0,1] range!
  • 8.
    std-CDF: use thecumulative density function of the standard normal distribution [Webber+08] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 Each curve is a topic, with 110 runs represented as dots raw nDCG std-CDF nDCG
  • 9.
    std-CDF: emphasises moderatelyhigh and moderately low performers – is this a good thing? 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 raw nDCG std-CDF nDCG Moderately high Moderately low
  • 10.
    std-AB: How abouta simple linear transformation? [Sakai16ICTIR] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15) TREC04 raw nDCG
  • 11.
    std-AB with clipping,with the range [0,1] Let B=0.5 (“average” system) Let A=0.15 so that 89% of scores fall within [0.05, 0.95] (Chebyshev’s inequality) For EXTREMELY good/bad systems… This formula with (A,B) is used in educational research: A=100, B=500 for SAT, GRE [Lodico+10], A=10, B=50 for Japanese hensachi “standard scores”.
  • 12.
    In practice, clippingdoes not happen often. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 raw nDCG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 std-AB nDCG Topic ID
  • 13.
    [Sakai16ICTIR] bottom line •Advantages of score standardisation: - removes topic hardness, enables comparison across test collections - normalisation becomes unnecessary • Advantages of std-AB over std-CDF: Low within-system variances and therefore - Substantially lower swap rates (higher consistency across different data) - Enables us to consider realistic topic set sizes in topic set design Swap rates for std-CDF can be higher than those for raw scores, probably due to its nonlinear transformation std-AB is a good alternative to std-CDF.
  • 14.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 15.
    Topic set sizedesign (1) [Sakai16IRJ] • Provides answers to the following question: “I’m building a new test collection. How many topics should I create?” • A prerequisite: a small topic-by-run score matrix based on pilot data, for estimating within-system variances. • Three approaches (with easy-to-use Excel tools), based on [Nagata03]: (1) paired t-test power (2) one-way ANOVA power (3) confidence interval width upperbound.
  • 16.
    Topic set sizedesign (2) [Sakai16IRJ] Method Input required Paired t-test α (Type I error probability), β (Type II error probability), minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-β)% power), : variance estimate for the score delta. one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power), : estimate of the within-system variance under the homoscedasticity assumption. Confidence intervals α (Type I error probability), δ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller), : variance estimate for the score delta.
  • 17.
    Topic set sizedesign (3) [Sakai16IRJ] Test collection designs should evolve based on past data topic-by-run score matrix with pilot data About 25 topics with runs from a few teams probably sufficient [Sakai+16EVIA] n1 topics m runs Estimate n1 based on the within-system variance estimate TREC 201X TREC 201(X+1) n2 topics n0 topics Estimate n2 based on the within-system variance estimate A more accurate estimate
  • 18.
    Topic set sizedesign (4) [Sakai16IRJ] ANOVA-based results for m=10 can be used instead of CI-based results ANOVA-based results for m=2 can be used instead of t-test-based results In practice, you can deduce t-test-based and CI-based results from ANOVA-based results Caveat: the ANOVA-based tool can only handle (α, β)=(0.05, 0.20), (0.01, 0.20), (0.05, 0.10), (0.01, 0.10).
  • 19.
    Method Input required one-wayANOVA α (Type I error probability), β (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power), : estimate of the within-system variance under the homoscedasticity assumption. Example situation: You plan to compare m systems with one-way ANOVA with α=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between the best and the worst systems >= minD. You know from pilot data that the within-system variance for nDCG is around . What is the required number of topics n? Topic set size design with one-way ANOVA (1) m systems best worst minD <= D
  • 20.
    http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx will do thisfor you! Use the appropriate sheet for a given (α, β) and fill out the orange cells. : n=20 is what you want! Topic set size design with one-way ANOVA (2)
  • 21.
    Estimating the variance(1) We need for topic set size design based on one-way ANOVA and for that based on the paired t-test or CI. From a pilot topic-by-run score matrix, obtain: Then, if possible, pool multiple estimates to enhance accuracy: Pooled estimate By-product of one-way ANOVA (use two-way w/o replilcation for tighter estimates) Multiple data not available in this study
  • 22.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 23.
    Variances obtained fromNTCIR-12 tasks mC nC Variances are substantially smaller after applying std-AB. Unnormalised measures can be handled without any problems.
  • 24.
    Why the variancesare smaller after applying std-AB The initial estimate of n with the one-way ANOVA topic set size design is given by [Nagata03] where, for (α, β)=(0.05, 0.20), λ ≒ So n will be small if is small. With std-AB, is indeed small because A is small (e.g. 0.15) and it can be shown that Noncentrality parameter of a noncentral chi-square distribution
  • 25.
    System rankings beforeand after applying std-AB mC nC System rankings before and after applying std-AB are statistically equivalent. std-AB enables cross-collection comparisons without affecting within-collection comparisons!
  • 26.
    MedNLPDoc (1) [Aramaki+16] https://sites.google.com/site/mednlpdoc/ •INPUT: a medical record • OUTPUT: ICD (international classification of diseases) codes of possible disease names • MEASURES: precision and recall of ICDs precision recall 14 runs 14 runs 78 topics 76 topics
  • 27.
    MedNLPDoc (2) [Aramaki+16] https://sites.google.com/site/mednlpdoc/ 76topics Raw recall: - Lots of 0’s - Some 1’s std-AB recall: - No 0’s - Fewer 1’s 0 100 200 300 400 500 600 700 0 50 100 150 200 250 300 350 score range score range
  • 28.
    MobileClick-2 iUnit ranking(1) [Kato+16] http://mobileclick.org/ • INPUT: iUnits (relevant nuggets for a mobile search summary) • OUTPUT: iUnits ranked by relevance • MEASURES: nDCG [Jarvelin+02] = Σ g(r)/log(r+1) / Σ g*(r)/log(r+1) Q-measure [Sakai05AIRS04] = (1/R) Σ I(r) BR(r) where BR(r) = ( Σ I(k) + β Σ g(k) )/( r + βΣ g*(k) ) l r=1 l r=1 r r k=1 r k=1 r k=1 gain at r in an ideal list 1 if relevant, 0 otherwise
  • 29.
    MobileClick-2 iUnit ranking(2) [Kato+16] http://mobileclick.org/ Raw nDCG: - hard topics, easy topics 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 std-AB nDCG: - topics look more comparable to one another
  • 30.
    MobileClick-2 iUnit summarisation(1) [Kato+16] http://mobileclick.org/ • INPUT: iUnits (relevant nuggets for a mobile search summary) • OUTPUT: two-layered textual summary • MEASURES: M-measure, a variant of the intent-aware U-measure [Sakai+13SIGIR] M-measure is an unnormalised measure: does not have the [0,1] range. (Intent-aware measures difficult to normalise.) [Kato+16]
  • 31.
    MobileClick-2 iUnit summarisation(2) [Kato+16] http://mobileclick.org/ Raw M-measure: - unnormalised, unbounded, extremely large variances - topics definitely not comparable (note the different scale of the y axis) std-AB M-measure: - no problem! 0 100 200 300 400 500 0 100 200 300 400 500 600 40-45 0.9-1.0 Clearly violates i.i.d
  • 32.
    STC (short textconversation) (1) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm • INPUT: a Weibo post (Chinese tweet) • OUTPUT: a ranked list of Weibo posts from a repository that serve as valid responses to the input • MEASURES: nG@1 (normalised gain at 1, a.k.a. “nDCG@1”) nERR@10 [Chapelle11] P+ [Sakai06AIRS] a variant of Q-measure
  • 33.
    STC (short textconversation) (2) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm Raw P+: - Lots of 1’s 0’s - Gap in the [0.625, 1] range (see previous slide) std-AB P+: - Looks like a continuous measure! - Fewer 1’s - No 0’s 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0 500 1000 1500 0 500 1000 1500
  • 34.
    STC (short textconversation) (3) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm Raw nG@1: - 0 or 1/3 or 1! 0 1000 2000 3000 0 500 1000 1500 2000 2500 std-AB nG@1: - Looks like a continuous measure! - Fewer 1’s - No 0’s
  • 35.
    QALab-2 (1) [Shibuki+16] http://research.nii.ac.jp/qalab/ •INPUT: a multiple-choice Japanese National Center Test (university entrance exam) question on world history • OUTPUT: choice deemed correct by system • MEASURES: Boolean: 1 (correct) or 0 (incorrect)
  • 36.
    QALab-2 (2) [Shibuki+16] http://research.nii.ac.jp/qalab/ 36topicsRaw Boolean: - 0 or 1! std-AB Boolean: - Two distinct ranges of values [0.2999, 0.4460] and [0.6091, 0.9047] Normal assumption still clearly violated: our topic set size design results should be interpreted as those for normally-distributed measures that happen to have variances similar to Raw/std-AB Boolean. QALab-2 organisers sorted the topics by #systems_correctly_answered before providing the matrices to the present author 0 200 400 600 800 0 200 400 600
  • 37.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 38.
    A few recommendationsfor MedNLPDoc (1) With raw recall: create 100 topics to guarantee 80% power for - minD=0.10 for m=2 systems - minD=0.20 for m=50 systems MedNLPDoc had 76-78 topics at NTCIR-12.
  • 39.
    A few recommendationsfor MedNLPDoc (2) With std-AB recall: create 80 topics to guarantee 80% power for - minD=0.05 for m=2 systems - minD=0.10 for m=50 systems MedNLPDoc had 76-78 topics at NTCIR-12. Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 40.
    A few recommendationsfor MobileClick-2 (1) MobileClick-2 had 100 topics at NTCIR-12. Topic set size needs to be set by considering both subtasks, but raw M-measure cannot be handled due to extremely large variance. If we only consider iUnit ranking raw nDCG@3: create 90 topics to guarantee 80% power for - minD=0.10 for m=10 English systems - minD=0.10 for m=2 Japanese systems
  • 41.
    A few recommendationsfor MobileClick-2 (2) MobileClick-2 had 100 topics at NTCIR-12. With std-AB nDCG@3 and std-AB M-measure: create 100 topics to guarantee 80% power for - minD=0.10 for m=20 English and m=30 Japanese iUnit ranking systems - minD=0.05 for m=10 English and m=10 Japanese iUnit summarisation systems
  • 42.
    A few recommendationsfor STC (1) With (a normally distributed measure whose variance is similar to that of) raw nG@1: create 120 topics to guarantee 80% power for - minD=0.20 for m=20 systems STC had 100 topics at NTCIR-12.
  • 43.
    A few recommendationsfor STC (2) STC had 100 topics at NTCIR-12. With std-AB nG@1: create 100 topics to guarantee 80% power for - minD=0.10 for m=30 systems Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 44.
    A few recommendationsfor QALab-2 (1) QALab-2 had 36-41 topics at NTCIR-12: not sufficient from the viewpoint of power With (a normally distributed measure whose variance is similar to that of) raw Boolean: create 90 topics to guarantee 80% power for - minD=0.20 for m=2 systems
  • 45.
    A few recommendationsfor QALab-2 (2) QALab-2 had 36-41 topics at NTCIR-12. With (a normally distributed measure whose variance is similar to that of) std-AB Boolean: create 40 topics to guarantee 80% power for - minD=0.10 for m=2 systems - minD=0.20 for m=50 systems Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 46.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 47.
    Conclusions • std-AB suppressesscore variances and thereby enables test collection builders to consider realistic choices of topic set sizes. • topic set size design with std-AB can handle even unnormalised such as M-measure (U-measure, TBG, alpha-nDCG, ERR-IA etc.). • Even discrete measures such as nG@1 (0 or 1/3 or 1) look more continuous after applying std-AB, which makes the topic set size design results (based on normality and i.i.d assumptions) perhaps a little more believable. • Test collection designs should evolve based on experiences (i.e. variances pooled from past data).
  • 48.
    TALK OUTLINE 1. Scorestandardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 49.
    How long willthe standardisation factors for each topic remain valid? standardised score for i-th system, j-th topic j i raw Topics Systems j i std Topics Systems Subtract mean; divide by standard deviation How good is i compared to “average” in standard deviation units? Standardising factors These systems will eventually become outdated, right?
  • 50.
    We Want Web@NTCIR-13(1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems New runs pooled for frozen + fresh topics
  • 51.
    We Want Web@NTCIR-13(2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems Official NTCIR-13 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13 systems NOT released Qrels + std. factors based on NTCIR-13 systems released
  • 52.
    We Want Web@NTCIR-14(1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 53.
    We Want Web@NTCIR-14(2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems Official NTCIR-14 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems NOT released Qrels + std. factors based on NTCIR-(13+)14 systems released Using the NTCIR-14 fresh topics, compare new NTCIR- 14 runs with revived runs and quantify progress.
  • 54.
    We Want Web@NTCIR-15(1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 55.
    We Want Web@NTCIR-15(2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-(13+14+)15 systems released Using the NTCIR-15 fresh topics, compare new NTCIR- 15 runs with revived runs and quantify progress.
  • 56.
    We Want Web@NTCIR-15(3) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the standardisation factors for each frozen topic differ across the 3 rounds? Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-(13+14+)15 systems released
  • 57.
    We Want Web@NTCIR-15(4) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Qrels + std. factors based on NTCIR-(13+14+)15 systems released Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the NTCIR-15 system rankings differ across the 3 rounds, with and w/o standardisation? NTCIR-15 systems ranking NTCIR-15 systems ranking NTCIR-15 systems ranking
  • 58.
    See you allin Tokyo, in August/December 2017!
  • 59.
    Selected references (1) [Aramaki+16]Aramaki et al.: Overview of the NTCIR-12 MedNLPDoc task, NTCIR-12 Proceedings, 2016. [Carterette+08] Carterette et al.: Evaluation over Thousands of Queries, SIGIR 2008. [Chapelle+11] Chapelle et al.: Intent-based Diversification of Web Search Results: Metrics and Algorithms, Information Retrieval 14(6), 2011. [Jarvelin+02] Jarvelin and Kelalainen: Cumulated Gain-based Evaluation of IR techniques, ACM TOIS 20(4), 2002. [Gilbert+79] Gilbert and Sparck Jones:, Statistical Bases of Relevance assessment for the `IDEAL’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1979. [Kato+16] Kato et al.: Overview of the NTCIR-12 MobileClick task, NTCIR-12 Proceedings, 2016. [Nagata03] Nagata: How to Design the Sample Size (in Japanese), Asakura Shoten, 2003.
  • 60.
    Selected references (2) [Sakai05AIRS04]Sakai: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS 2004 (LNCS 3411), 2005. [Sakai06AIRS] Sakai: Bootstrap-based Comparisons of IR Metrics for Finding One Relevant Document, AIRS 2006 (LNCS 4182). [Sakai+13SIGIR] Sakai and Dou: Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, SIGIR 2013. [Sakai16ICTIR] Sakai: A simple and effective approach to score standardisaiton, ICTIR 2016. [Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice (tutorial), ICTIR 2016. [Sakai16IRJ] Sakai: Topic set size design, Information Retrieval, 19(3), 2016. OPEN ACCESS: http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf [Sakai+16EVIA] Sakai and Shang: On Estimating Variances for Topic Set Size Design, EVIA 2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/evia/02- EVIA2016-SakaiT.pdf
  • 61.
    Selected references (3) [Shang+16]Shang et al.: Overview of the NTCIR-12 short text conversation task, NTCIR-12 Proceedings, 2016. [Shibuki+16] Shibuki et al.: Overview of the NTCIR-12 QA Lab-2 task, NTCIR-12 Proceedings, 2016. [SparckJones+75] Sparck Jones and Van Rijsbergen: Report on the Need for and Provision on an `Ideal’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1975. [Voorhees+05] Voorhees and Harman: TREC: Experiment and Evaluation in Information Retrieval, The MIT Press, 2005. [Voorhees09] Voorhees: Topic Set Size Redux, SIGIR 2009. [Webber+08SIGIR] Webber, Moffat, Zobel: Score standardisation for inter-collection comparison of retrieval systems, SIGIR 2008. [Webber+08CIKM] Webber, Moffat, Zobel: Statistical power in retrieval experimentation, CIKM 2008.