SlideShare a Scribd company logo
1 of 34
Download to read offline
Overview of the NTCIR-15
We Want Web with CENTRE (WWW-3) Task
December 9, 2020@NTCIR-15 (virtual conference)
Web Search is not a solved problem!
• Are we making progress?
(Example: does deep
learning-based reranking
really outperform a
properly-tuned BM25 for
any query?)
• Can we
replicate/reproduce the
findings? (same method,
same/different data,
different research groups)
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Chinese subtask definition
• Input: 80 WWW-2 topics and 80 new WWW-3
topics (participants had access to the original qrels
for WWW-2)
• Output: TREC-style run file
• Target corpus: SogouT-16
• All runs were pooled and relevance assessments
were conducted for 80 WWW-3 new topics
• Runs are scored also based on the 80 WWW-3
topics
Topics
• The 80 queries were sampled from Sogou’s query
logs in one day of August 2018, which contain 54
torso queries, 13 tail queries and 13 hot queries.
Runs and qrels
• 11 runs from 3 teams (including the organisers’
baseline) were submitted and pooled
Official results (nDCG and Q)
Official results (nERR and iRBU)
Randomised Tukey HSD test results
(nDCG and Q)
OUTPERFORMS
Randomised Tukey HSD test results
(nERR and iRBU)
OUTPERFORMS
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
English subtask definition
• Input: 80 WWW-2 topics and 80 new WWW-3
topics (participants had access to the original qrels
for WWW-2)
• Output: TREC-style run file
• Target corpus: clueweb12-B13
• All runs were pooled and relevance assessments
were conducted for all 160 topics
• Runs are scored based on the 80 WWW-3 topics
The original plan with a REV run
(a revived system from NTCIR-14)
• Replicability: compare a repli run with a REV run on
the WWW-2 topics
• Reproducibility: compare a repro run effectiveness
on the WWW-3 topics with a REV run effectiveness
on the WWW-2 topics
• Progress: compare new runs and a REV run (SOTA
from NTCIR-14) on the WWW-3 topics
But unfortunately, we could not obtain a reliable
REV run that represents the SOTA from NTCIR-14
on the NTCIR-15 WWW-3 topics.
Runs and qrels
• 37 runs from 9 teams (including the organisers’
baseline) were submitted and pooled
Official top 10 runs (nDCG and Q)
Official top 10 runs (nERR and iRBU)
Randomised Tukey HSD test results
(nDCG and Q) – top runs only
OUTPERFORMS
Randomised Tukey HSD test results
(nERR and iRBU) – top runs only
OUTPERFORMS
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Replicability and Reproducibility
Terminology
“An experimental result is not fully established unless
it can be independently reproduced.”
OLD ACM Terminology (Version 1.0):
• Replicability: Different team, same experimental
setup
• Reproducibility: Different team, different
experimental setup
With the new ACM terminology (Version 1.1)
replicability and reproducibility are swapped!
Version 1.0: https://www.acm.org/publications/policies/artifact-review-badging
Version 1.1: https://www.acm.org/publications/policies/artifact-review-and-badging-current
Replicability Measures
Ranking:
Kendall’s τ and
RBO
Absolute Per-Topic Effectiveness:
RMSEabs
Statistical approach: p-value of paired t-test
Effect over a baseline: RMSEΔ, Effect Ratio (ERrepli) and Delta
Relative Improvement (ΔRIrepli)
Reproducibility measures
unpaired
Replicability & Reproducibility
Runs
• Target Runs submitted at WWW-2:
• Advanced: THUIR-E-CO-MAN-Base2 (LambdaMART)
• Baseline: THUIR-E-CO-PU-Base4 (BM25)
• Replicability and Reproducibility runs submitted at
WWW-3:
• Advanced: KASYS-E-CO-REP-2 and SLWWW-E-CO-REP-4
• Baseline: KASYS-E-CO-REP-3
• Replicability: WWW-2 qrels and topics;
• Reproducibility: WWW-2 qrels and topics
compared against WWW-3 qrels and topics.
Replicability recap
WWW-2 topics WWW-3 topics
WWW-2runsWWW-3runs
A-run (advanced)
B-run (baseline)
Effect
A-run (advanced)
B-run (baseline)
Effect
Replicability Results: Ranking of
Documents
• Kendall’s τ and RBO: computed between the original
ranking of documents and the replicated ranking;
• The closer to 1 the better the replicated run;
• Scores close to 0 mean that the original and replicated
runs are not correlated;
• It is extremely hard to obtain the same list of
documents!
• RMSE: the closer to 0 the better;
• p-value: small p-value means that the runs are
significantly different (without specifying whether
they are better or not);
Large RMSEs
Replicability Results: RMSE and
p-values
Very small p-values
Replicability Results: Effect over a
Baseline
• Implications of ER scores:
• ER ≤ 0: Failed replication, A-run failed to outperform the
B-run;
• 0 < ER < 1: Somehow successful, the replicated effect is
smaller compared to the original effect;
• ER = 1: Perfect replication;
• ER > 1: Successful replication, the replicated effect is
larger compared to the original effect.
• Similar interpretation of ΔRI but 0 is the perfect
replication;
Reproducibility recap
WWW-2 topics WWW-3 topics
WWW-2runsWWW-3runs
A-run (advanced)
B-run (baseline)
Effect
A-run (advanced)
B-run (baseline)
Effect
Reproducibility Results: p-values
and Effects over a Baseline
• Recall that there is no target original run;
• Reproduciblity is even harder than replicability!
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Summary
• Chinese subtask (only 3 teams)
Best run: RUCIR-C-CD-NEW-4
• English subtask (9 teams)
Best runs: KASYS-E-CO-NEW-{1,4} and mpii-E-CO-
NEW-1. KASYS uses a BERT-based method from
[Yilmaz+ EMNLP 2019].
• CENTRE:
We need a community effort since replicability and
reproducibility are very tough problems!
Thank you participants!
And many thanks to the NTCIR PC chairs, GCs, and staff!
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
WWW will be back
(IF our task proposal is accepted)
• English subtask only
• New English corpus! (Common Crawl?)
• New target for replicability, reproducibility, and a
baseline for progress:
University of Tsukuba’s BERT-based run from WWW-3
• Topics to be released in October 2021
• Run submission deadline in November 2021
• Please follow @ntcirwww on Twitter!
Selected references
[Breuer+ SIGIR2020] How to Measure the Reproducibility
of System-oriented IR Experiments, ACM SIGIR 2020.
[Sakai+ TOIS2020] Retrieval Evaluation Measures that
Agree with Users' SERP Preferences: Traditional,
Preference-based, and Diversity Measures, ACM TOIS
39(2), to appear, 2020.
[Yilmaz+ EMNLP2019] Cross-Domain Modeling of
Sentence-Level Evidence for Document Retrieval, EMNLP
2019.
More about CENTRE
Evaluation measures
including iRBU
University of Tsukuba’s top
run is based on this

More Related Content

What's hot

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?BIOVIA
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierCrai Macdonald
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)Massimiliano Ruocco
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Sung Kim
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Goran S. Milovanovic
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a servicePetrie Wong
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Riccardo Tommasini
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Goran S. Milovanovic
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 

What's hot (18)

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a service
 
Rob Davidson: Using Galaxy for Metabolomics
Rob Davidson: Using Galaxy for MetabolomicsRob Davidson: Using Galaxy for Metabolomics
Rob Davidson: Using Galaxy for Metabolomics
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 

Similar to NTCIR15WWW3overview

Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
Elasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsElasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsTubular Labs
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
KASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKohei Shinden
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Rakebul Hasan
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataRakebul Hasan
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsLucas Augusto Carvalho
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...dgarijo
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...Sung Kim
 
Fast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesFast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesWenlei Xie
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackNick Craswell
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
 

Similar to NTCIR15WWW3overview (20)

Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
Elasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsElasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular Labs
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
KASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 Task
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked Data
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research Objects
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
 
Fast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesFast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block Updates
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
 

More from Tetsuya Sakai (20)

sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

NTCIR15WWW3overview

  • 1. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task December 9, 2020@NTCIR-15 (virtual conference)
  • 2. Web Search is not a solved problem! • Are we making progress? (Example: does deep learning-based reranking really outperform a properly-tuned BM25 for any query?) • Can we replicate/reproduce the findings? (same method, same/different data, different research groups)
  • 3. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 4. Chinese subtask definition • Input: 80 WWW-2 topics and 80 new WWW-3 topics (participants had access to the original qrels for WWW-2) • Output: TREC-style run file • Target corpus: SogouT-16 • All runs were pooled and relevance assessments were conducted for 80 WWW-3 new topics • Runs are scored also based on the 80 WWW-3 topics
  • 5. Topics • The 80 queries were sampled from Sogou’s query logs in one day of August 2018, which contain 54 torso queries, 13 tail queries and 13 hot queries.
  • 6. Runs and qrels • 11 runs from 3 teams (including the organisers’ baseline) were submitted and pooled
  • 9. Randomised Tukey HSD test results (nDCG and Q) OUTPERFORMS
  • 10. Randomised Tukey HSD test results (nERR and iRBU) OUTPERFORMS
  • 11. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 12. English subtask definition • Input: 80 WWW-2 topics and 80 new WWW-3 topics (participants had access to the original qrels for WWW-2) • Output: TREC-style run file • Target corpus: clueweb12-B13 • All runs were pooled and relevance assessments were conducted for all 160 topics • Runs are scored based on the 80 WWW-3 topics
  • 13. The original plan with a REV run (a revived system from NTCIR-14) • Replicability: compare a repli run with a REV run on the WWW-2 topics • Reproducibility: compare a repro run effectiveness on the WWW-3 topics with a REV run effectiveness on the WWW-2 topics • Progress: compare new runs and a REV run (SOTA from NTCIR-14) on the WWW-3 topics But unfortunately, we could not obtain a reliable REV run that represents the SOTA from NTCIR-14 on the NTCIR-15 WWW-3 topics.
  • 14. Runs and qrels • 37 runs from 9 teams (including the organisers’ baseline) were submitted and pooled
  • 15. Official top 10 runs (nDCG and Q)
  • 16. Official top 10 runs (nERR and iRBU)
  • 17. Randomised Tukey HSD test results (nDCG and Q) – top runs only OUTPERFORMS
  • 18. Randomised Tukey HSD test results (nERR and iRBU) – top runs only OUTPERFORMS
  • 19. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 20. Replicability and Reproducibility Terminology “An experimental result is not fully established unless it can be independently reproduced.” OLD ACM Terminology (Version 1.0): • Replicability: Different team, same experimental setup • Reproducibility: Different team, different experimental setup With the new ACM terminology (Version 1.1) replicability and reproducibility are swapped! Version 1.0: https://www.acm.org/publications/policies/artifact-review-badging Version 1.1: https://www.acm.org/publications/policies/artifact-review-and-badging-current
  • 21. Replicability Measures Ranking: Kendall’s τ and RBO Absolute Per-Topic Effectiveness: RMSEabs Statistical approach: p-value of paired t-test Effect over a baseline: RMSEΔ, Effect Ratio (ERrepli) and Delta Relative Improvement (ΔRIrepli) Reproducibility measures unpaired
  • 22. Replicability & Reproducibility Runs • Target Runs submitted at WWW-2: • Advanced: THUIR-E-CO-MAN-Base2 (LambdaMART) • Baseline: THUIR-E-CO-PU-Base4 (BM25) • Replicability and Reproducibility runs submitted at WWW-3: • Advanced: KASYS-E-CO-REP-2 and SLWWW-E-CO-REP-4 • Baseline: KASYS-E-CO-REP-3 • Replicability: WWW-2 qrels and topics; • Reproducibility: WWW-2 qrels and topics compared against WWW-3 qrels and topics.
  • 23. Replicability recap WWW-2 topics WWW-3 topics WWW-2runsWWW-3runs A-run (advanced) B-run (baseline) Effect A-run (advanced) B-run (baseline) Effect
  • 24. Replicability Results: Ranking of Documents • Kendall’s τ and RBO: computed between the original ranking of documents and the replicated ranking; • The closer to 1 the better the replicated run; • Scores close to 0 mean that the original and replicated runs are not correlated; • It is extremely hard to obtain the same list of documents!
  • 25. • RMSE: the closer to 0 the better; • p-value: small p-value means that the runs are significantly different (without specifying whether they are better or not); Large RMSEs Replicability Results: RMSE and p-values Very small p-values
  • 26. Replicability Results: Effect over a Baseline • Implications of ER scores: • ER ≤ 0: Failed replication, A-run failed to outperform the B-run; • 0 < ER < 1: Somehow successful, the replicated effect is smaller compared to the original effect; • ER = 1: Perfect replication; • ER > 1: Successful replication, the replicated effect is larger compared to the original effect. • Similar interpretation of ΔRI but 0 is the perfect replication;
  • 27. Reproducibility recap WWW-2 topics WWW-3 topics WWW-2runsWWW-3runs A-run (advanced) B-run (baseline) Effect A-run (advanced) B-run (baseline) Effect
  • 28. Reproducibility Results: p-values and Effects over a Baseline • Recall that there is no target original run; • Reproduciblity is even harder than replicability!
  • 29. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 30. Summary • Chinese subtask (only 3 teams) Best run: RUCIR-C-CD-NEW-4 • English subtask (9 teams) Best runs: KASYS-E-CO-NEW-{1,4} and mpii-E-CO- NEW-1. KASYS uses a BERT-based method from [Yilmaz+ EMNLP 2019]. • CENTRE: We need a community effort since replicability and reproducibility are very tough problems!
  • 31. Thank you participants! And many thanks to the NTCIR PC chairs, GCs, and staff!
  • 32. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 33. WWW will be back (IF our task proposal is accepted) • English subtask only • New English corpus! (Common Crawl?) • New target for replicability, reproducibility, and a baseline for progress: University of Tsukuba’s BERT-based run from WWW-3 • Topics to be released in October 2021 • Run submission deadline in November 2021 • Please follow @ntcirwww on Twitter!
  • 34. Selected references [Breuer+ SIGIR2020] How to Measure the Reproducibility of System-oriented IR Experiments, ACM SIGIR 2020. [Sakai+ TOIS2020] Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures, ACM TOIS 39(2), to appear, 2020. [Yilmaz+ EMNLP2019] Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval, EMNLP 2019. More about CENTRE Evaluation measures including iRBU University of Tsukuba’s top run is based on this