- The document summarizes the Birch search model proposed by researchers at the University of Tsukuba for the NTCIR WWW-3 ad-hoc search task. Birch applies a BERT model trained on other question answering and tweet search datasets to estimate sentence-level relevance between queries and documents. It then combines the BERT relevance scores with BM25 scores to rank documents.
- The researchers achieved the best performance on the WWW-3 task using variants of the Birch model fine-tuned on the MS MARCO and TREC MB datasets. They also replicated the THUIR runs from the previous NTCIR WWW-2 task to check consistency of results across models and experiments. The replication was
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
NTCIR-15 www-3 kasys poster
1. at the NTCIR-15 WWW-3 Task
Kohei Shinden, Atsuki Maruta and Makoto P. Kato (University of Tsukuba)
• NTCIR WWW-3 Task
‒ Ad-hoc search tasks for Web documents
• Proposed search model using BERT (Birch)
‒ Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for
Document Retrieval, EMNLP 2019
‒ BERT has been successfully applied to a broad range of NLP tasks
including document ranking tasks
Shinjuku
Pet Pig
Halloween Pictures
Search system
ClueWeb12
Web documents
730 million
Input Output
THE 10 BEST Cheap
Hotels in Tokyo...
Keeping and Caring
for Potbellied Pigs ...
Halloween Picture
Collection 30...
Evaluation
nDCG
0.6
⁝
0.5
80 Topics
Input
Ranked documents
• Applying the relevance between sentences learned in QA and
Microblog search datasets to ad hoc searches
‒ Estimate the relevance of the sentences in the query and document
BERT
Model
MB
BERT
Halloween Pictures
Microblog
Data
Trick or Treat...
0.7
Children get candy...
0.3
カボチャを使...
0.1
0.4
BERT + BM25 Score
0.6
Sentence-Level
Relevance
Judgments Model
BM25
Score
BERT
Score
Sentences Document
⁝
Using Query and
Sentence Relevance
BERT cannot handle Text
at the Document Level
Less Query and Sentence
Relevance Data in
Ad-hoc Search Tasks
Using Models Learned in
Other Tasks
Use Test Collection
• MS MARCO: QA Task
• TREC CAR : A is a complex QA task
• TREC MB : Tweet search task
• A typical web document exceeds
BERT's maximum input
sequence length of 512 tokens.
• Linear sum of the BM25 and the score of the highest
BERT-scoring sentence in the document
‒ Assuming that the most relevant sen-tences in a document are
good indicators of the document-levelrelevance
• BERTi : The BERT score of the sentence with the i-th highest BERT score in the document
• wi : Hyperparameter
[1] Yilmaz et al: Cross-Domain Modeling of Sentence-level
Evidence for Document Retrieval, EMNLP 2019
!"#$% = '(25 + , -!
!
"#$
. '/01!
KASYS for NEW Runs
Birch (Yilmaz et al, 2019)
Key Ideas in Birch
Details of Birch
Experimental Results
Achieved the best performances in terms of nDCG, Q, iRBU
• KASYS-E-CO-NEW-1 (Fine-tuning with MS MARCO & TREC MB)
• KASYS-E-CO-NEW-4 (Fine-tuning with MS MARCO & TREC MB)
• KASYS-E-CO-NEW-5 (Fine-tuning with TREC CAR & TREC MB)
KASYS for REP Runs
Replicating and reproducing the THUIR runs
at the NTCIT-14 WWW-2 Task
Whether the results across models are consistent in each experiment
THUIR KASYS(ours)
BM25 BM25
LambdaMART
(learning-to-rank model)
LambdaMART
(learning-to-rank-model)
<
<
❓
disney
switch
Canon
‥‥
Clueweb
Collection
Ranked by
BM25
algorithm
input output
Disney shop
Tokyo Disney
resort
Disney
official
‥‥
Ranked web documents
1st
2nd
3rd
input
Feature
extracting
program
Extracted eight features
・Clueweb : A dataset with about 730 million pages of English web pages
Using tf, idf, docement
length, BM25, LMIR as
features
Up to BM25
LamdbaMART from here
WWW-2 and WWW-3 topics
honda
Pokemon
ice age
‥‥
Extraction
feature
program
qid:001 1:0.2 ‥
qid:001 1:0.5 ‥
qid:001 1:0.1 ‥
qid:001 1:0.9 ‥
output
‥‥
Query and features from document
LambdaMART
input
MQ Track WWW-1 test
collection
train validate
Disney
official
Disney shop
Tokyo Disney
resort
Re-ranked web document
1st
2nd
3rd
‥‥
output
Evaluate
Replication Procedure
Implementation Details
Experimental Results
• Features for learning to rank
• TF, IDF, TF-IDF, document length, BM25 score, and language-
model-based IR scores
• The differences from the original run (THUIR)
• Although THUIR extracted the features from four fields (whole
document, anchor text, title, URL),
we extracted from only the whole document
• Normalized by maximum and minimum values because the
normalization of features was not described in THUIR paper
0.43
0.53
AdaRank LamdbaMART Coordinate Ascent BM25
0.3
0.35
0.4
0.28
0.33
nDCG@10 Q@10 nERR@10
Ours Original
Ours Original Ours Original
Successfully replicated with the original WWW-2 qrels
Original WWW-2 qrels
WWW-3 official results
Seems that KASYS
failed to replicate
and reproduce the
THUIR run