ntcir14centre-overview

Overview of the
NTCIR-14 CENTRE Task
Tetsuya Sakai Nicola Ferro Ian Soboroff
Zhaohao Zeng Peng Xiao Maria Maistro
Waseda University, Japan
University of Padua, Italy
NIST, USA
http://www.centre-eval.org/
Twitter: @_centre_
June 11 and 13, 2019@NTCIR-14, Tokyo.

TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts

Motivation
• Researcher A publishes a paper; says
Algo X > Algo Y on test data D.
⇒ Researcher B tries Algo X and Y on D but finds
Algo X < Algo Y. A replicability problem.
• Researcher A publishes a paper; says
Algo X > Algo Y on test data D.
⇒ Researcher B tries Algo X and Y on D’ but finds
Algo X < Algo Y. A reproducibility problem.
Can the IR community do better?

Research questions
• Can results from CLEF, NTCIR, and TREC be
replicated/reproduced easily? If not, what are the
difficulties? What needs to be improved?
• Can results from (say) TREC be reproduced with
(say) NTCIR data? Why or why not?
• [Participants may have their own research
questions]

CENTRE = CLEF/NTCIR/TREC
(replicability and) reproducibility
CLEF
Conferences
and Labs of the
Evaluation
Forum
NTCIR
NII Testbeds and
Community for
Information
access Research
TREC
Text REtrieval
Conference
Data, code, wisdom
Replicability and
Reproducibility experiments
Data, code, wisdom
Replicability and
Reproducibility experiments
Data, code, wisdom
Replicability and Reproducibility experiments
Europe and more;
Annual
US and more;
Annual
Asia and more;
Sesquinnual

Who should participate in CENTRE?
• IR evaluation nerds
• Students willing to replicate/reproduce SOTA
methods in IR, so that they will eventually be able
to propose their own methods and demonstrate
their superiority over the SOTA
• Those who want to be involved in more than one
evaluation communities across the globe
• Those who want to use data from more than one
evaluation conferences

CENTRE@NTCIR-14: How it works (1)
Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
IR literature
A, B: target
run pairs
Selected by
organisers
Selected by
each team

Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
T1{A,B}:
replicated run pairs
T1: replicability subtask
(same NTCIR data)
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
IR literature
A, B: target
run pairs
Selected by
organisers
Selected by
each team
Evaluation measures:
• RSME of the topicwise deltas
• Pearson correlation of the
topicwise deltas
• Effect Ratio (ER)

Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
T1{A,B}:
replicated run pairs
T1: replicability subtask
(same NTCIR data)
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
T2: reproducibility subtask
IR literature
T2CLEF-{A,B}
Run pairs
reproduced on
NTCIR data
T2TREC-{A,B}
T2OPEN-{A,B}
A, B: target
run pairs
Selected by
organisers
Selected by
each team
Evaluation measure:
Effect Ratio (ER)
T2OPEN runs evaluated only
with effectiveness measures
Past CLEF runs not considered at CENTRE@NTCIR-14

Target runs for CENTRE@NTCIR-14
• We focus on ad hoc monolingual web search.
• For T1 (replicability): A pair of runs from the NTCIR-13 We Want Web
task
- Overview paper:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/01-NTCIR13-OV-WWW-
LuoC.pdf
• For T2 (reproducibility): A pair of runs from the TREC 2013 web track
- Overview paper:
https://trec.nist.gov/pubs/trec22/papers/WEB.OVERVIEW.pdf
• All of the above target runs use clueweb12 as the search corpus
http://www.lemurproject.org/clueweb12.php/
• Both the target NTCIR and TREC runs are implemented using Indri
https://www.lemurproject.org/indri/

Target NTCIR runs and data
for T1 (replicability)
• Test collection:
NTCIR-13 WWW English (clueweb12-B13, 100 topics)
• Target NTCIR run pairs:
From the NTCIR-13 WWW RMIT paper [Gallagher+17]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf
A: RMIT-E-NU-Own-1 (sequential dependency model: SDM)
B: RMIT-E-NU-Own-3 (full dependency model: FDM)
• SDM and FDM are from [Metzler+Croft SIGIR05]
https://doi.org/10.1145/1076034.1076115

Target TREC runs and data
for T2 (reproduciblity)
• Test collection:
TREC 2013 Web Track (clueweb12 category A, 50 topics)
• Target TREC run pairs:
From the TREC 2013 Web Track U Delaware paper [Yang+13] :
https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf
A: UDInfolabWEB2 (selects semantically related terms using
Web-based working sets)
B: UDInfolabWEB1 (selects semantically related terms using
collection-based working sets)
• The working set construction method is from
[Fang+SIGIR06]
https://doi.org/10.1145/1148170.1148193

How participants were expected to
submit T1 (replicability) runs
0. Obtain clueweb12-B13 (or the full data)
and Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW
topics and qrels from the CENTRE organisers
2. Read the NTCIR-13 WWW RMIT paper [Gallagher+17]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf
and try to replicate the A-run and the B-run on the NTCIR-13
WWW-1 test collection
A: RMIT-E-NU-Own-1
B: RMIT-E-NU-Own-3
3. Submit the replicated runs by the deadline

submit T2TREC (reproducibility) runs
and (optionally) Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and
qrels from the CENTRE organisers
2. Read the TREC 2013 Web Track U Delaware paper [Yang+13] :
https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf
and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1
test collection
A: UDInfolabWEB2
B: UDInfolabWEB1
3. Submit the reproduced runs by the deadline

submit T2OPEN (reproducibility) runs
and (optionally) Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and
qrels from the CENTRE organisers
2. Pick any pair of runs A (advanced) and B (baseline) from the IR literature,
and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1
test collection
3. Submit the reproduced runs by the deadline
T2OPEN not discussed further in this overview.
For more, see:
Andrew Yates: MPII at the NTCIR-14 CENTRE Task

Submission instructions
• Each team can submit only one run pair per subtask
(T1, T2TREC, T2OPEN). So at most 6 runs in total.
• Run file format: same as WWW-1; see
http://www.thuir.cn/ntcirwww/ (Run Submissions
format). In short, it’s a TREC run file plus a SYSDESC
line.
• Run file names:
CENTRE1-<teamname>-<subtaskname>-[A,B]
Advanced Baseline

Additional relevance assessments
• NTCIR-13 WWW-1 English qrels expanded based on
the new CENTRE runs (i.e. 6 runs from MPII).
• Depth-30 pools were created, following WWW-1.
• topic-doc pairs previously unseen (2,617) were
judged.
NTCIR-13
WWW E
CENTRE
@NTCIR-14
100 topics
original qrels
Based on 13 runs from three teams
100 topics
augmented
qrels
22,912 topic-doc pairs 22,912 + 2,617 = 25,529 topic-doc pairs

Inter-assessor agreement
(two assessors per topic)

T1:examining the topicwise deltas
original topicwise improvement replicated topicwise improvement
Covariance
Standard deviations
Pearson correlation
Collection
Measure
RMSE
Root Mean Squared Error

T1 and T2: Effect Ratio (1)
• T1 (replicability)
• T2TREC (reproducibility)
original mean improvement
reproduced mean improvement
replicated mean improvement
original mean improvement

T1 and T2: Effect Ratio (2)
• ER compares the sample (unstandardised) effect
size in the “rep” experiment against that in the
original experiment. “Can we still observe the same
effect?”
• ER>1 ⇒ effect larger in “rep” experiment
• ER=1 ⇒ effect identical to original experiment
• 0 < ER < 1 ⇒ effect smaller in “rep” experiment
• ER <= 0 ⇒ FAILURE: A-run does not outperform B-
run in the “rep” experiment

Replicability: w/o additional assessments (1)
• RMIT said “SDM (A) outperforms FDM (B)”
• MPII confirms this!
• A statistically significantly outperforms B even with nERR!
• Effect sizes for nERR (Orig vs Repli) quite similar.

Replicability: w/o additional assessments (2)
ER for nERR close
to 1
Correlation
(OrigΔ vs RepliΔ)
for nERR
statistically
significant

Replicability: WITH additional assessments
ER for nERR close
to 1
Correlation
(OrigΔ vs RepliΔ)
for nERR
statistically
significant

nERR, r=0.2610, 95%CI[0.0680, 0.4351]
Orig Δ ≒ 0.4
Repli Δ ≒ 0.8
(A>B)
Orig Δ ≒ -0.3
Repli Δ ≒ -0.6
(A<B)
But replicability not so
successful at the topic level

Reproducibility: w/o additional assessments (1)
• UDel said “web-based working sets (A) outperform
collection-based working sets (B)” for TREC
• MPII confirms this for NTCIR
• A statistically significantly outperforms B even with Q!
• Effect sizes for Q (Orig vs Repli) quite similar.

Reproducibility: w/o additional assessments (2)
ER for Q close to 1

Reproducibility: WITH additional assessments
Larger ER values!
New relevant docs were found by the CENTRE runs; addition worthwhile!
Smaller
p-values!

CENTRE@NTCIR-14 conclusions
• We focussed on: “Can an effect (gain of A over B)
replicated/reproduced?”
• Replication of the RMIT effect on WWW-1 data:
successful at the ER level; not so successful at the
topic level.
• Reproduction of the TREC UDel effect on WWW-1
data: successful at the ER level; additional
relevance assessments further improved the ERs.
• Only one participating team…this needs to change.

SPECIAL THANKS
• To Andrew Yates (Max Planck Institute for
Informatics, Germany)
• MPII at the NTCIR-14 CENTRE Task

WWW and CENTRE join forces
• WWW-3 task will have three subtasks:
(1) Adhoc Chinese
(2) Adhoc English (WWW-2 + new WWW-topics)
(3) CENTRE (WWW-2 + new WWW-3 topics)
• CENTRE will accept two run types:
(a) Revived runs from WWW-2 (Tsinghua and RUC)
(b) REP (REPlicated and REProduced) runs: Tsinghua
and RUC algorithms re-implemented by other
teams

WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
WWW-3
Adhoc
(Team D)
Measuring progress:
WWW-2 top run → WWW-3 top run
COMPARE
Use the WWW-3 topics
(since WWW-3 systems
may be tuned with
WWW-2 topics)

WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
Replicated/reproduced
(different team)
(different team)
WWW-3 CENTRE
REP run 1-1
(Team C)
WWW-3 CENTRE
REP run 1-2
(Team C)
Measuring replicability on WWW-2 topics
COMPARE Use the WWW-2 topics

WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
(different team)
(different team)
WWW-3 CENTRE
REP run 1-1
(Team C)
WWW-3 CENTRE
REP run 1-2
(Team C)
Measuring reproducibility:
WWW-2-topics → WWW-3 topics
COMPARE
Use the WWW-2 topics Use the WWW-3 topics

CENTRE@CLEF 2019
CLEF NTCIR TREC
REproducibility
Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya
Sakai, and Ian Soboroff
April 15, 2019, Cologne
http://www.centre-eval.org/clef2019/

CENTRE@CLEF 2019
TasksThe CENTRE@CLEF 2019 lab will offer three tasks:
• Task 1 - Replicability: the task will focus on the
replicability of selected methods on the same
experimental collections;
• Task 2 - Reproducibility: the task will focus on the
reproducibility of selected methods on the different
experimental collections;
• Task 3 - Generalizability: the task will focus on collection
performance prediction and the goal is to rank (sub-
)collections on the basis of the expected performance
over them.

Task 1 & 2: Selected
Papers
For Task 1 - Replicability and Task 2 - Reproducibility we
selected the following papers from TREC Common
Core Track 2017 and 2018:
• [Grossman et al, 2017] Grossman, M. R., and
Cormack, G. V. (2017). MRG_UWaterloo and
Waterloo Cormack Participation in the TREC 2017
Common Core Track. In TREC 2017.
• [Benham et al, 2018] Benham, R., Gallagher, L.,
Mackenzie, J., Liu, B., Lu, X., Scholer, F., Moffat, A.,
and Culpepper, J. S. (2018). RMIT at the 2018 TREC
CORE Track. In TREC 2018.

Task 3 - Generalizability
• Training: participants need to run plain BM25 and
they need to identify features of the corpus and
topics that allow them to predict the system score
with respect to Average Precision (AP).
• Validation: participants can validate their method
and determine which set of features represent the
best choice for predicting AP score for each system.
• Test (submission): participants submit a run for
each system (BM25 and their own system) and an
additional file (one for each system) including the
AP score predicted for each topic.

SIGIR 2019 Open-Source IR
Replicability Challenge (OSIRRC 2019)
• Repeatability, replicability, and reproducibility:
• Good science!
• Helps sustain cumulative empirical progress in the field
• OSIRRC is a workshop at SIGIR with three goals:
• Develop a common Docker interface to capture ad hoc
retrieval experiments
• Build a library of Docker images that implements the
interface
• Explore this approach to additional tasks and evaluation
methodologies

OSIRRC 2019: Current Status
• We’ve converged on a Docker specification for ad
hoc retrieval experiments – called “the jig”
• All materials at https://github.com/osirrc
• Initial library of 12 images from 9 different groups
under development
• It’s not too late if you still want to participate!
• Wrapping an existing retrieval system in “the jig” isn’t
too difficult a task…
• See you at SIGIR 2019!

Nicola Ferro
@frrncl
The 3rd Strategic Workshop on Information Retrieval in Lorne (SWIRL III)
14 February 2018, Lorne, Australia
ACM Artifact Badging - SIGIR
Survey

ACM Badging: Artifacts Evaluated
This badge is applied to papers whose associated artifacts have
successfully completed an independent audit. Artifacts need not
be made publicly available to be considered for this badge.
However, they do need to be made available to reviewers.
Artifacts Evaluated – Functional
The artifacts associated with the research are found to be documented,
consistent, complete, exercisable, and include appropriate evidence of
verification and validation.
Artifacts Evaluated – Reusable
The artifacts associated with the paper are of a quality that significantly
exceeds minimal functionality. That is, they have all the qualities of the
Artifacts Evaluated – Functional level, but, in addition, they are very
carefully documented and well-structured to the extent that reuse and
repurposing is facilitated. In particular, norms and standards of the
research community for artifacts of this type are strictly adhered to.
5
3

ACM Badging: Artifacts Available
This badge is applied to papers in which
associated artifacts have been made
permanently available for retrieval.
Author-created artifacts relevant to this paper
have been placed on a publicicly accessible
archival repository. A DOI or link to this
repository along with a unique identifier for
the object is provided
Repositories used to archive data should have a
declared plan to enable permanent accessibility.
Personal web pages are not acceptable for this
purpose
Artifacts do not need to have been formally
evaluated in order for an article to receive this badge.
5
4

ACM Badging: Results Validated
This badge is applied to papers in which the main results of the
paper have been successfully obtained by a person or team
other than the author
Results Replicated
The main results of the paper have been obtained in a subsequent
study by a person or team other than the authors, using, in part,
artifacts provided by the author
Results Reproduced
The main results of the paper have been independently obtained in a
subsequent study by a person or team other than the authors,
without the use of author supplied artifacts.
Exact replication or reproduction of results is not required,
or even expected. In particular, differences in the results
should not change the main claims made in the paper
5
5

SIGIR Badging Task Force
Nicola Ferro (chair)
Diane Kelly (ex-officio)
The University of Tennessee,
Knoxville, USA
Maarten de Rijke (ex-officio)
University of Amsterdam, The
Netherlands
Leif Azzopardi
University of Strathclyde, UK
Peter Bailey
Microsoft, USA
Hannah Bast
University of Freiburg, Germany
Rob Capra
University of North Carolina at
Chapel Hill, USA
Norbert Fuhr
University of Duisburg-Essen,
Germany
Yiqun Liu
Tsinghua University, China
Martin Potthast
Leipzig University, Germany
Filip Radlinski
Google London, UK
Tetsuya Sakai
Waseda University, Japan
Ian Soboroff
National Institute of Standards and
Technology, USA
Arjen de Vries
Radboud University, The
Netherlands
5
6

ntcir14centre-overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ntcir14centre-overview

Similar to ntcir14centre-overview (20)

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

Recently uploaded

Recently uploaded (20)

ntcir14centre-overview