Overview of the
NTCIR-14 CENTRE Task
Tetsuya Sakai Nicola Ferro Ian Soboroff
Zhaohao Zeng Peng Xiao Maria Maistro
Waseda University, Japan
University of Padua, Italy
NIST, USA
http://www.centre-eval.org/
Twitter: @_centre_
June 11 and 13, 2019@NTCIR-14, Tokyo.
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
Motivation
• Researcher A publishes a paper; says
Algo X > Algo Y on test data D.
⇒ Researcher B tries Algo X and Y on D but finds
Algo X < Algo Y. A replicability problem.
• Researcher A publishes a paper; says
Algo X > Algo Y on test data D.
⇒ Researcher B tries Algo X and Y on D’ but finds
Algo X < Algo Y. A reproducibility problem.
Can the IR community do better?
Research questions
• Can results from CLEF, NTCIR, and TREC be
replicated/reproduced easily? If not, what are the
difficulties? What needs to be improved?
• Can results from (say) TREC be reproduced with
(say) NTCIR data? Why or why not?
• [Participants may have their own research
questions]
CENTRE = CLEF/NTCIR/TREC
(replicability and) reproducibility
CLEF
Conferences
and Labs of the
Evaluation
Forum
NTCIR
NII Testbeds and
Community for
Information
access Research
TREC
Text REtrieval
Conference
Data, code, wisdom
Replicability and
Reproducibility experiments
Data, code, wisdom
Replicability and
Reproducibility experiments
Data, code, wisdom
Replicability and Reproducibility experiments
Europe and more;
Annual
US and more;
Annual
Asia and more;
Sesquinnual
Who should participate in CENTRE?
• IR evaluation nerds
• Students willing to replicate/reproduce SOTA
methods in IR, so that they will eventually be able
to propose their own methods and demonstrate
their superiority over the SOTA
• Those who want to be involved in more than one
evaluation communities across the globe
• Those who want to use data from more than one
evaluation conferences
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
CENTRE@NTCIR-14: How it works (1)
Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
IR literature
A, B: target
run pairs
Selected by
organisers
Selected by
each team
CENTRE@NTCIR-14: How it works (2)
Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
T1{A,B}:
replicated run pairs
T1: replicability subtask
(same NTCIR data)
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
IR literature
A, B: target
run pairs
Selected by
organisers
Selected by
each team
Evaluation measures:
• RSME of the topicwise deltas
• Pearson correlation of the
topicwise deltas
• Effect Ratio (ER)
CENTRE@NTCIR-14: How it works (2)
Past CLEFs
Past TRECs
Past NTCIRs
A, B: target
run pairs
T1{A,B}:
replicated run pairs
T1: replicability subtask
(same NTCIR data)
A: advanced
B: baseline
A, B: target
run pairs
A, B: target
run pairs
T2: reproducibility subtask
IR literature
T2CLEF-{A,B}
Run pairs
reproduced on
NTCIR data
T2TREC-{A,B}
T2OPEN-{A,B}
A, B: target
run pairs
Selected by
organisers
Selected by
each team
Evaluation measure:
Effect Ratio (ER)
T2OPEN runs evaluated only
with effectiveness measures
Past CLEF runs not considered at CENTRE@NTCIR-14
Target runs for CENTRE@NTCIR-14
• We focus on ad hoc monolingual web search.
• For T1 (replicability): A pair of runs from the NTCIR-13 We Want Web
task
- Overview paper:
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/01-NTCIR13-OV-WWW-
LuoC.pdf
• For T2 (reproducibility): A pair of runs from the TREC 2013 web track
- Overview paper:
https://trec.nist.gov/pubs/trec22/papers/WEB.OVERVIEW.pdf
• All of the above target runs use clueweb12 as the search corpus
http://www.lemurproject.org/clueweb12.php/
• Both the target NTCIR and TREC runs are implemented using Indri
https://www.lemurproject.org/indri/
Target NTCIR runs and data
for T1 (replicability)
• Test collection:
NTCIR-13 WWW English (clueweb12-B13, 100 topics)
• Target NTCIR run pairs:
From the NTCIR-13 WWW RMIT paper [Gallagher+17]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf
A: RMIT-E-NU-Own-1 (sequential dependency model: SDM)
B: RMIT-E-NU-Own-3 (full dependency model: FDM)
• SDM and FDM are from [Metzler+Croft SIGIR05]
https://doi.org/10.1145/1076034.1076115
Target TREC runs and data
for T2 (reproduciblity)
• Test collection:
TREC 2013 Web Track (clueweb12 category A, 50 topics)
• Target TREC run pairs:
From the TREC 2013 Web Track U Delaware paper [Yang+13] :
https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf
A: UDInfolabWEB2 (selects semantically related terms using
Web-based working sets)
B: UDInfolabWEB1 (selects semantically related terms using
collection-based working sets)
• The working set construction method is from
[Fang+SIGIR06]
https://doi.org/10.1145/1148170.1148193
How participants were expected to
submit T1 (replicability) runs
0. Obtain clueweb12-B13 (or the full data)
http://www.lemurproject.org/clueweb12.php/
and Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW
topics and qrels from the CENTRE organisers
2. Read the NTCIR-13 WWW RMIT paper [Gallagher+17]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf
and try to replicate the A-run and the B-run on the NTCIR-13
WWW-1 test collection
A: RMIT-E-NU-Own-1
B: RMIT-E-NU-Own-3
3. Submit the replicated runs by the deadline
How participants were expected to
submit T2TREC (reproducibility) runs
0. Obtain clueweb12-B13 (or the full data)
http://www.lemurproject.org/clueweb12.php/
and (optionally) Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and
qrels from the CENTRE organisers
2. Read the TREC 2013 Web Track U Delaware paper [Yang+13] :
https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf
and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1
test collection
A: UDInfolabWEB2
B: UDInfolabWEB1
3. Submit the reproduced runs by the deadline
How participants were expected to
submit T2OPEN (reproducibility) runs
0. Obtain clueweb12-B13 (or the full data)
http://www.lemurproject.org/clueweb12.php/
and (optionally) Indri https://www.lemurproject.org/indri/
1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and
qrels from the CENTRE organisers
2. Pick any pair of runs A (advanced) and B (baseline) from the IR literature,
and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1
test collection
3. Submit the reproduced runs by the deadline
T2OPEN not discussed further in this overview.
For more, see:
Andrew Yates: MPII at the NTCIR-14 CENTRE Task
Submission instructions
• Each team can submit only one run pair per subtask
(T1, T2TREC, T2OPEN). So at most 6 runs in total.
• Run file format: same as WWW-1; see
http://www.thuir.cn/ntcirwww/ (Run Submissions
format). In short, it’s a TREC run file plus a SYSDESC
line.
• Run file names:
CENTRE1-<teamname>-<subtaskname>-[A,B]
Advanced Baseline
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
Only one team participated…
Additional relevance assessments
• NTCIR-13 WWW-1 English qrels expanded based on
the new CENTRE runs (i.e. 6 runs from MPII).
• Depth-30 pools were created, following WWW-1.
• topic-doc pairs previously unseen (2,617) were
judged.
NTCIR-13
WWW E
CENTRE
@NTCIR-14
100 topics
original qrels
Based on 13 runs from three teams
100 topics
augmented
qrels
22,912 topic-doc pairs 22,912 + 2,617 = 25,529 topic-doc pairs
Inter-assessor agreement
(two assessors per topic)
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
T1:examining the topicwise deltas
original topicwise improvement replicated topicwise improvement
Covariance
Standard deviations
Pearson correlation
Collection
Measure
RMSE
Root Mean Squared Error
T1 and T2: Effect Ratio (1)
• T1 (replicability)
• T2TREC (reproducibility)
original mean improvement
reproduced mean improvement
replicated mean improvement
original mean improvement
T1 and T2: Effect Ratio (2)
• ER compares the sample (unstandardised) effect
size in the “rep” experiment against that in the
original experiment. “Can we still observe the same
effect?”
• ER>1 ⇒ effect larger in “rep” experiment
• ER=1 ⇒ effect identical to original experiment
• 0 < ER < 1 ⇒ effect smaller in “rep” experiment
• ER <= 0 ⇒ FAILURE: A-run does not outperform B-
run in the “rep” experiment
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
Replicability: w/o additional assessments (1)
• RMIT said “SDM (A) outperforms FDM (B)”
• MPII confirms this!
• A statistically significantly outperforms B even with nERR!
• Effect sizes for nERR (Orig vs Repli) quite similar.
Replicability: w/o additional assessments (2)
ER for nERR close
to 1
Correlation
(OrigΔ vs RepliΔ)
for nERR
statistically
significant
Replicability: WITH additional assessments
ER for nERR close
to 1
Correlation
(OrigΔ vs RepliΔ)
for nERR
statistically
significant
nERR, r=0.2610, 95%CI[0.0680, 0.4351]
Orig Δ ≒ 0.4
Repli Δ ≒ 0.8
(A>B)
Orig Δ ≒ -0.3
Repli Δ ≒ -0.6
(A<B)
But replicability not so
successful at the topic level
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
Reproducibility: w/o additional assessments (1)
• UDel said “web-based working sets (A) outperform
collection-based working sets (B)” for TREC
• MPII confirms this for NTCIR
• A statistically significantly outperforms B even with Q!
• Effect sizes for Q (Orig vs Repli) quite similar.
Reproducibility: w/o additional assessments (2)
ER for Q close to 1
Reproducibility: WITH additional assessments
Larger ER values!
New relevant docs were found by the CENTRE runs; addition worthwhile!
Smaller
p-values!
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
CENTRE@NTCIR-14 conclusions
• We focussed on: “Can an effect (gain of A over B)
replicated/reproduced?”
• Replication of the RMIT effect on WWW-1 data:
successful at the ER level; not so successful at the
topic level.
• Reproduction of the TREC UDel effect on WWW-1
data: successful at the ER level; additional
relevance assessments further improved the ERs.
• Only one participating team…this needs to change.
SPECIAL THANKS
• To Andrew Yates (Max Planck Institute for
Informatics, Germany)
• MPII at the NTCIR-14 CENTRE Task
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
WWW and CENTRE join forces
• WWW-3 task will have three subtasks:
(1) Adhoc Chinese
(2) Adhoc English (WWW-2 + new WWW-topics)
(3) CENTRE (WWW-2 + new WWW-3 topics)
• CENTRE will accept two run types:
(a) Revived runs from WWW-2 (Tsinghua and RUC)
(b) REP (REPlicated and REProduced) runs: Tsinghua
and RUC algorithms re-implemented by other
teams
WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
WWW-3
Adhoc
(Team D)
Measuring progress:
WWW-2 top run → WWW-3 top run
COMPARE
Use the WWW-3 topics
(since WWW-3 systems
may be tuned with
WWW-2 topics)
WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
Replicated/reproduced
(different team)
Replicated/reproduced
(different team)
WWW-3 CENTRE
REP run 1-1
(Team C)
WWW-3 CENTRE
REP run 1-2
(Team C)
Measuring replicability on WWW-2 topics
COMPARE Use the WWW-2 topics
WWW-3 CENTRE
revived run 1
(Team A)
WWW-2 topics
WWW-2
target run 2
(Team B)
WWW-2
target run 1
(Team A)
WWW-3 CENTRE
revived run 2
(Team B)
revived (same team)
revived (same team)
New WWW-3 topics
Processed by
WWW-2 runs
Processed by
WWW-3 runs
Replicated/reproduced
(different team)
Replicated/reproduced
(different team)
WWW-3 CENTRE
REP run 1-1
(Team C)
WWW-3 CENTRE
REP run 1-2
(Team C)
Measuring reproducibility:
WWW-2-topics → WWW-3 topics
COMPARE
Use the WWW-2 topics Use the WWW-3 topics
TALK OUTLINE
1. CENTRE@CLEF, NTCIR, TREC: motivations
2. CENTRE@NTCIR-14 task specifications
3. Additional relevance assessments
4. Evaluation measures
5. Replicability results
6. Reproducibility results
7. CENTRE@NTCIR-14 conclusions
8. Plans for NTCIR-15
9. Related efforts
CENTRE@CLEF 2019
CLEF NTCIR TREC
REproducibility
Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya
Sakai, and Ian Soboroff
April 15, 2019, Cologne
http://www.centre-eval.org/clef2019/
CENTRE@CLEF 2019
TasksThe CENTRE@CLEF 2019 lab will offer three tasks:
• Task 1 - Replicability: the task will focus on the
replicability of selected methods on the same
experimental collections;
• Task 2 - Reproducibility: the task will focus on the
reproducibility of selected methods on the different
experimental collections;
• Task 3 - Generalizability: the task will focus on collection
performance prediction and the goal is to rank (sub-
)collections on the basis of the expected performance
over them.
Task 1 & 2: Selected
Papers
For Task 1 - Replicability and Task 2 - Reproducibility we
selected the following papers from TREC Common
Core Track 2017 and 2018:
• [Grossman et al, 2017] Grossman, M. R., and
Cormack, G. V. (2017). MRG_UWaterloo and
Waterloo Cormack Participation in the TREC 2017
Common Core Track. In TREC 2017.
• [Benham et al, 2018] Benham, R., Gallagher, L.,
Mackenzie, J., Liu, B., Lu, X., Scholer, F., Moffat, A.,
and Culpepper, J. S. (2018). RMIT at the 2018 TREC
CORE Track. In TREC 2018.
Task 3 - Generalizability
• Training: participants need to run plain BM25 and
they need to identify features of the corpus and
topics that allow them to predict the system score
with respect to Average Precision (AP).
• Validation: participants can validate their method
and determine which set of features represent the
best choice for predicting AP score for each system.
• Test (submission): participants submit a run for
each system (BM25 and their own system) and an
additional file (one for each system) including the
AP score predicted for each topic.
SIGIR 2019 Open-Source IR
Replicability Challenge (OSIRRC 2019)
• Repeatability, replicability, and reproducibility:
• Good science!
• Helps sustain cumulative empirical progress in the field
• OSIRRC is a workshop at SIGIR with three goals:
• Develop a common Docker interface to capture ad hoc
retrieval experiments
• Build a library of Docker images that implements the
interface
• Explore this approach to additional tasks and evaluation
methodologies
OSIRRC 2019: Current Status
• We’ve converged on a Docker specification for ad
hoc retrieval experiments – called “the jig”
• All materials at https://github.com/osirrc
• Initial library of 12 images from 9 different groups
under development
• It’s not too late if you still want to participate!
• Wrapping an existing retrieval system in “the jig” isn’t
too difficult a task…
• See you at SIGIR 2019!
Nicola Ferro
@frrncl
University of Padua, Italy
The 3rd Strategic Workshop on Information Retrieval in Lorne (SWIRL III)
14 February 2018, Lorne, Australia
ACM Artifact Badging - SIGIR
Survey
ACM Badging: Artifacts Evaluated
This badge is applied to papers whose associated artifacts have
successfully completed an independent audit. Artifacts need not
be made publicly available to be considered for this badge.
However, they do need to be made available to reviewers.
Artifacts Evaluated – Functional
The artifacts associated with the research are found to be documented,
consistent, complete, exercisable, and include appropriate evidence of
verification and validation.
Artifacts Evaluated – Reusable
The artifacts associated with the paper are of a quality that significantly
exceeds minimal functionality. That is, they have all the qualities of the
Artifacts Evaluated – Functional level, but, in addition, they are very
carefully documented and well-structured to the extent that reuse and
repurposing is facilitated. In particular, norms and standards of the
research community for artifacts of this type are strictly adhered to.
5
3
ACM Badging: Artifacts Available
This badge is applied to papers in which
associated artifacts have been made
permanently available for retrieval.
Author-created artifacts relevant to this paper
have been placed on a publicicly accessible
archival repository. A DOI or link to this
repository along with a unique identifier for
the object is provided
Repositories used to archive data should have a
declared plan to enable permanent accessibility.
Personal web pages are not acceptable for this
purpose
Artifacts do not need to have been formally
evaluated in order for an article to receive this badge.
5
4
ACM Badging: Results Validated
This badge is applied to papers in which the main results of the
paper have been successfully obtained by a person or team
other than the author
Results Replicated
The main results of the paper have been obtained in a subsequent
study by a person or team other than the authors, using, in part,
artifacts provided by the author
Results Reproduced
The main results of the paper have been independently obtained in a
subsequent study by a person or team other than the authors,
without the use of author supplied artifacts.
Exact replication or reproduction of results is not required,
or even expected. In particular, differences in the results
should not change the main claims made in the paper
5
5
SIGIR Badging Task Force
Nicola Ferro (chair)
University of Padua, Italy
Diane Kelly (ex-officio)
The University of Tennessee,
Knoxville, USA
Maarten de Rijke (ex-officio)
University of Amsterdam, The
Netherlands
Leif Azzopardi
University of Strathclyde, UK
Peter Bailey
Microsoft, USA
Hannah Bast
University of Freiburg, Germany
Rob Capra
University of North Carolina at
Chapel Hill, USA
Norbert Fuhr
University of Duisburg-Essen,
Germany
Yiqun Liu
Tsinghua University, China
Martin Potthast
Leipzig University, Germany
Filip Radlinski
Google London, UK
Tetsuya Sakai
Waseda University, Japan
Ian Soboroff
National Institute of Standards and
Technology, USA
Arjen de Vries
Radboud University, The
Netherlands
5
6

ntcir14centre-overview

  • 1.
    Overview of the NTCIR-14CENTRE Task Tetsuya Sakai Nicola Ferro Ian Soboroff Zhaohao Zeng Peng Xiao Maria Maistro Waseda University, Japan University of Padua, Italy NIST, USA http://www.centre-eval.org/ Twitter: @_centre_ June 11 and 13, 2019@NTCIR-14, Tokyo.
  • 2.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 3.
    Motivation • Researcher Apublishes a paper; says Algo X > Algo Y on test data D. ⇒ Researcher B tries Algo X and Y on D but finds Algo X < Algo Y. A replicability problem. • Researcher A publishes a paper; says Algo X > Algo Y on test data D. ⇒ Researcher B tries Algo X and Y on D’ but finds Algo X < Algo Y. A reproducibility problem. Can the IR community do better?
  • 4.
    Research questions • Canresults from CLEF, NTCIR, and TREC be replicated/reproduced easily? If not, what are the difficulties? What needs to be improved? • Can results from (say) TREC be reproduced with (say) NTCIR data? Why or why not? • [Participants may have their own research questions]
  • 5.
    CENTRE = CLEF/NTCIR/TREC (replicabilityand) reproducibility CLEF Conferences and Labs of the Evaluation Forum NTCIR NII Testbeds and Community for Information access Research TREC Text REtrieval Conference Data, code, wisdom Replicability and Reproducibility experiments Data, code, wisdom Replicability and Reproducibility experiments Data, code, wisdom Replicability and Reproducibility experiments Europe and more; Annual US and more; Annual Asia and more; Sesquinnual
  • 6.
    Who should participatein CENTRE? • IR evaluation nerds • Students willing to replicate/reproduce SOTA methods in IR, so that they will eventually be able to propose their own methods and demonstrate their superiority over the SOTA • Those who want to be involved in more than one evaluation communities across the globe • Those who want to use data from more than one evaluation conferences
  • 7.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 8.
    CENTRE@NTCIR-14: How itworks (1) Past CLEFs Past TRECs Past NTCIRs A, B: target run pairs A: advanced B: baseline A, B: target run pairs A, B: target run pairs IR literature A, B: target run pairs Selected by organisers Selected by each team
  • 9.
    CENTRE@NTCIR-14: How itworks (2) Past CLEFs Past TRECs Past NTCIRs A, B: target run pairs T1{A,B}: replicated run pairs T1: replicability subtask (same NTCIR data) A: advanced B: baseline A, B: target run pairs A, B: target run pairs IR literature A, B: target run pairs Selected by organisers Selected by each team Evaluation measures: • RSME of the topicwise deltas • Pearson correlation of the topicwise deltas • Effect Ratio (ER)
  • 10.
    CENTRE@NTCIR-14: How itworks (2) Past CLEFs Past TRECs Past NTCIRs A, B: target run pairs T1{A,B}: replicated run pairs T1: replicability subtask (same NTCIR data) A: advanced B: baseline A, B: target run pairs A, B: target run pairs T2: reproducibility subtask IR literature T2CLEF-{A,B} Run pairs reproduced on NTCIR data T2TREC-{A,B} T2OPEN-{A,B} A, B: target run pairs Selected by organisers Selected by each team Evaluation measure: Effect Ratio (ER) T2OPEN runs evaluated only with effectiveness measures Past CLEF runs not considered at CENTRE@NTCIR-14
  • 11.
    Target runs forCENTRE@NTCIR-14 • We focus on ad hoc monolingual web search. • For T1 (replicability): A pair of runs from the NTCIR-13 We Want Web task - Overview paper: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/01-NTCIR13-OV-WWW- LuoC.pdf • For T2 (reproducibility): A pair of runs from the TREC 2013 web track - Overview paper: https://trec.nist.gov/pubs/trec22/papers/WEB.OVERVIEW.pdf • All of the above target runs use clueweb12 as the search corpus http://www.lemurproject.org/clueweb12.php/ • Both the target NTCIR and TREC runs are implemented using Indri https://www.lemurproject.org/indri/
  • 12.
    Target NTCIR runsand data for T1 (replicability) • Test collection: NTCIR-13 WWW English (clueweb12-B13, 100 topics) • Target NTCIR run pairs: From the NTCIR-13 WWW RMIT paper [Gallagher+17] http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf A: RMIT-E-NU-Own-1 (sequential dependency model: SDM) B: RMIT-E-NU-Own-3 (full dependency model: FDM) • SDM and FDM are from [Metzler+Croft SIGIR05] https://doi.org/10.1145/1076034.1076115
  • 13.
    Target TREC runsand data for T2 (reproduciblity) • Test collection: TREC 2013 Web Track (clueweb12 category A, 50 topics) • Target TREC run pairs: From the TREC 2013 Web Track U Delaware paper [Yang+13] : https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf A: UDInfolabWEB2 (selects semantically related terms using Web-based working sets) B: UDInfolabWEB1 (selects semantically related terms using collection-based working sets) • The working set construction method is from [Fang+SIGIR06] https://doi.org/10.1145/1148170.1148193
  • 14.
    How participants wereexpected to submit T1 (replicability) runs 0. Obtain clueweb12-B13 (or the full data) http://www.lemurproject.org/clueweb12.php/ and Indri https://www.lemurproject.org/indri/ 1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and qrels from the CENTRE organisers 2. Read the NTCIR-13 WWW RMIT paper [Gallagher+17] http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/02-NTCIR13-WWW-GallagherL.pdf and try to replicate the A-run and the B-run on the NTCIR-13 WWW-1 test collection A: RMIT-E-NU-Own-1 B: RMIT-E-NU-Own-3 3. Submit the replicated runs by the deadline
  • 15.
    How participants wereexpected to submit T2TREC (reproducibility) runs 0. Obtain clueweb12-B13 (or the full data) http://www.lemurproject.org/clueweb12.php/ and (optionally) Indri https://www.lemurproject.org/indri/ 1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and qrels from the CENTRE organisers 2. Read the TREC 2013 Web Track U Delaware paper [Yang+13] : https://trec.nist.gov/pubs/trec22/papers/udel_fang-web.pdf and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1 test collection A: UDInfolabWEB2 B: UDInfolabWEB1 3. Submit the reproduced runs by the deadline
  • 16.
    How participants wereexpected to submit T2OPEN (reproducibility) runs 0. Obtain clueweb12-B13 (or the full data) http://www.lemurproject.org/clueweb12.php/ and (optionally) Indri https://www.lemurproject.org/indri/ 1. Register to the CENTRE task and obtain the NTCIR-13 WWW topics and qrels from the CENTRE organisers 2. Pick any pair of runs A (advanced) and B (baseline) from the IR literature, and try to reproduce the A-run and the B-run on the NTCIR-13 WWW-1 test collection 3. Submit the reproduced runs by the deadline T2OPEN not discussed further in this overview. For more, see: Andrew Yates: MPII at the NTCIR-14 CENTRE Task
  • 17.
    Submission instructions • Eachteam can submit only one run pair per subtask (T1, T2TREC, T2OPEN). So at most 6 runs in total. • Run file format: same as WWW-1; see http://www.thuir.cn/ntcirwww/ (Run Submissions format). In short, it’s a TREC run file plus a SYSDESC line. • Run file names: CENTRE1-<teamname>-<subtaskname>-[A,B] Advanced Baseline
  • 18.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 19.
    Only one teamparticipated…
  • 20.
    Additional relevance assessments •NTCIR-13 WWW-1 English qrels expanded based on the new CENTRE runs (i.e. 6 runs from MPII). • Depth-30 pools were created, following WWW-1. • topic-doc pairs previously unseen (2,617) were judged. NTCIR-13 WWW E CENTRE @NTCIR-14 100 topics original qrels Based on 13 runs from three teams 100 topics augmented qrels 22,912 topic-doc pairs 22,912 + 2,617 = 25,529 topic-doc pairs
  • 21.
  • 22.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 23.
    T1:examining the topicwisedeltas original topicwise improvement replicated topicwise improvement Covariance Standard deviations Pearson correlation Collection Measure RMSE Root Mean Squared Error
  • 24.
    T1 and T2:Effect Ratio (1) • T1 (replicability) • T2TREC (reproducibility) original mean improvement reproduced mean improvement replicated mean improvement original mean improvement
  • 25.
    T1 and T2:Effect Ratio (2) • ER compares the sample (unstandardised) effect size in the “rep” experiment against that in the original experiment. “Can we still observe the same effect?” • ER>1 ⇒ effect larger in “rep” experiment • ER=1 ⇒ effect identical to original experiment • 0 < ER < 1 ⇒ effect smaller in “rep” experiment • ER <= 0 ⇒ FAILURE: A-run does not outperform B- run in the “rep” experiment
  • 26.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 27.
    Replicability: w/o additionalassessments (1) • RMIT said “SDM (A) outperforms FDM (B)” • MPII confirms this! • A statistically significantly outperforms B even with nERR! • Effect sizes for nERR (Orig vs Repli) quite similar.
  • 28.
    Replicability: w/o additionalassessments (2) ER for nERR close to 1 Correlation (OrigΔ vs RepliΔ) for nERR statistically significant
  • 29.
    Replicability: WITH additionalassessments ER for nERR close to 1 Correlation (OrigΔ vs RepliΔ) for nERR statistically significant
  • 30.
    nERR, r=0.2610, 95%CI[0.0680,0.4351] Orig Δ ≒ 0.4 Repli Δ ≒ 0.8 (A>B) Orig Δ ≒ -0.3 Repli Δ ≒ -0.6 (A<B) But replicability not so successful at the topic level
  • 31.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 32.
    Reproducibility: w/o additionalassessments (1) • UDel said “web-based working sets (A) outperform collection-based working sets (B)” for TREC • MPII confirms this for NTCIR • A statistically significantly outperforms B even with Q! • Effect sizes for Q (Orig vs Repli) quite similar.
  • 33.
    Reproducibility: w/o additionalassessments (2) ER for Q close to 1
  • 34.
    Reproducibility: WITH additionalassessments Larger ER values! New relevant docs were found by the CENTRE runs; addition worthwhile! Smaller p-values!
  • 35.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 36.
    CENTRE@NTCIR-14 conclusions • Wefocussed on: “Can an effect (gain of A over B) replicated/reproduced?” • Replication of the RMIT effect on WWW-1 data: successful at the ER level; not so successful at the topic level. • Reproduction of the TREC UDel effect on WWW-1 data: successful at the ER level; additional relevance assessments further improved the ERs. • Only one participating team…this needs to change.
  • 37.
    SPECIAL THANKS • ToAndrew Yates (Max Planck Institute for Informatics, Germany) • MPII at the NTCIR-14 CENTRE Task
  • 38.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 39.
    WWW and CENTREjoin forces • WWW-3 task will have three subtasks: (1) Adhoc Chinese (2) Adhoc English (WWW-2 + new WWW-topics) (3) CENTRE (WWW-2 + new WWW-3 topics) • CENTRE will accept two run types: (a) Revived runs from WWW-2 (Tsinghua and RUC) (b) REP (REPlicated and REProduced) runs: Tsinghua and RUC algorithms re-implemented by other teams
  • 40.
    WWW-3 CENTRE revived run1 (Team A) WWW-2 topics WWW-2 target run 2 (Team B) WWW-2 target run 1 (Team A) WWW-3 CENTRE revived run 2 (Team B) revived (same team) revived (same team) New WWW-3 topics Processed by WWW-2 runs Processed by WWW-3 runs WWW-3 Adhoc (Team D) Measuring progress: WWW-2 top run → WWW-3 top run COMPARE Use the WWW-3 topics (since WWW-3 systems may be tuned with WWW-2 topics)
  • 41.
    WWW-3 CENTRE revived run1 (Team A) WWW-2 topics WWW-2 target run 2 (Team B) WWW-2 target run 1 (Team A) WWW-3 CENTRE revived run 2 (Team B) revived (same team) revived (same team) New WWW-3 topics Processed by WWW-2 runs Processed by WWW-3 runs Replicated/reproduced (different team) Replicated/reproduced (different team) WWW-3 CENTRE REP run 1-1 (Team C) WWW-3 CENTRE REP run 1-2 (Team C) Measuring replicability on WWW-2 topics COMPARE Use the WWW-2 topics
  • 42.
    WWW-3 CENTRE revived run1 (Team A) WWW-2 topics WWW-2 target run 2 (Team B) WWW-2 target run 1 (Team A) WWW-3 CENTRE revived run 2 (Team B) revived (same team) revived (same team) New WWW-3 topics Processed by WWW-2 runs Processed by WWW-3 runs Replicated/reproduced (different team) Replicated/reproduced (different team) WWW-3 CENTRE REP run 1-1 (Team C) WWW-3 CENTRE REP run 1-2 (Team C) Measuring reproducibility: WWW-2-topics → WWW-3 topics COMPARE Use the WWW-2 topics Use the WWW-3 topics
  • 43.
    TALK OUTLINE 1. CENTRE@CLEF,NTCIR, TREC: motivations 2. CENTRE@NTCIR-14 task specifications 3. Additional relevance assessments 4. Evaluation measures 5. Replicability results 6. Reproducibility results 7. CENTRE@NTCIR-14 conclusions 8. Plans for NTCIR-15 9. Related efforts
  • 44.
    CENTRE@CLEF 2019 CLEF NTCIRTREC REproducibility Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff April 15, 2019, Cologne http://www.centre-eval.org/clef2019/
  • 45.
    CENTRE@CLEF 2019 TasksThe CENTRE@CLEF2019 lab will offer three tasks: • Task 1 - Replicability: the task will focus on the replicability of selected methods on the same experimental collections; • Task 2 - Reproducibility: the task will focus on the reproducibility of selected methods on the different experimental collections; • Task 3 - Generalizability: the task will focus on collection performance prediction and the goal is to rank (sub- )collections on the basis of the expected performance over them.
  • 46.
    Task 1 &2: Selected Papers For Task 1 - Replicability and Task 2 - Reproducibility we selected the following papers from TREC Common Core Track 2017 and 2018: • [Grossman et al, 2017] Grossman, M. R., and Cormack, G. V. (2017). MRG_UWaterloo and Waterloo Cormack Participation in the TREC 2017 Common Core Track. In TREC 2017. • [Benham et al, 2018] Benham, R., Gallagher, L., Mackenzie, J., Liu, B., Lu, X., Scholer, F., Moffat, A., and Culpepper, J. S. (2018). RMIT at the 2018 TREC CORE Track. In TREC 2018.
  • 47.
    Task 3 -Generalizability • Training: participants need to run plain BM25 and they need to identify features of the corpus and topics that allow them to predict the system score with respect to Average Precision (AP). • Validation: participants can validate their method and determine which set of features represent the best choice for predicting AP score for each system. • Test (submission): participants submit a run for each system (BM25 and their own system) and an additional file (one for each system) including the AP score predicted for each topic.
  • 48.
    SIGIR 2019 Open-SourceIR Replicability Challenge (OSIRRC 2019) • Repeatability, replicability, and reproducibility: • Good science! • Helps sustain cumulative empirical progress in the field • OSIRRC is a workshop at SIGIR with three goals: • Develop a common Docker interface to capture ad hoc retrieval experiments • Build a library of Docker images that implements the interface • Explore this approach to additional tasks and evaluation methodologies
  • 49.
    OSIRRC 2019: CurrentStatus • We’ve converged on a Docker specification for ad hoc retrieval experiments – called “the jig” • All materials at https://github.com/osirrc • Initial library of 12 images from 9 different groups under development • It’s not too late if you still want to participate! • Wrapping an existing retrieval system in “the jig” isn’t too difficult a task… • See you at SIGIR 2019!
  • 50.
    Nicola Ferro @frrncl University ofPadua, Italy The 3rd Strategic Workshop on Information Retrieval in Lorne (SWIRL III) 14 February 2018, Lorne, Australia ACM Artifact Badging - SIGIR Survey
  • 51.
    ACM Badging: ArtifactsEvaluated This badge is applied to papers whose associated artifacts have successfully completed an independent audit. Artifacts need not be made publicly available to be considered for this badge. However, they do need to be made available to reviewers. Artifacts Evaluated – Functional The artifacts associated with the research are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation. Artifacts Evaluated – Reusable The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to. 5 3
  • 52.
    ACM Badging: ArtifactsAvailable This badge is applied to papers in which associated artifacts have been made permanently available for retrieval. Author-created artifacts relevant to this paper have been placed on a publicicly accessible archival repository. A DOI or link to this repository along with a unique identifier for the object is provided Repositories used to archive data should have a declared plan to enable permanent accessibility. Personal web pages are not acceptable for this purpose Artifacts do not need to have been formally evaluated in order for an article to receive this badge. 5 4
  • 53.
    ACM Badging: ResultsValidated This badge is applied to papers in which the main results of the paper have been successfully obtained by a person or team other than the author Results Replicated The main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author Results Reproduced The main results of the paper have been independently obtained in a subsequent study by a person or team other than the authors, without the use of author supplied artifacts. Exact replication or reproduction of results is not required, or even expected. In particular, differences in the results should not change the main claims made in the paper 5 5
  • 54.
    SIGIR Badging TaskForce Nicola Ferro (chair) University of Padua, Italy Diane Kelly (ex-officio) The University of Tennessee, Knoxville, USA Maarten de Rijke (ex-officio) University of Amsterdam, The Netherlands Leif Azzopardi University of Strathclyde, UK Peter Bailey Microsoft, USA Hannah Bast University of Freiburg, Germany Rob Capra University of North Carolina at Chapel Hill, USA Norbert Fuhr University of Duisburg-Essen, Germany Yiqun Liu Tsinghua University, China Martin Potthast Leipzig University, Germany Filip Radlinski Google London, UK Tetsuya Sakai Waseda University, Japan Ian Soboroff National Institute of Standards and Technology, USA Arjen de Vries Radboud University, The Netherlands 5 6