sigir2020

Good Evaluation Measures
based on
Document Preferences
Tetsuya Sakai and Zhaohao Zeng
Waseda University, Japan
tetsuya@waseda.jp
sakailab.com
https://waseda.box.com/sigir2020preprint
SIGIR 2020 online
#StayHome

RQ: How do traditional and document
preference-based IR measures align with
users’ SERP preferences?
• Agreement Rate (AR) = how many users agree with
the measure’s decision on which SERP is better
• Best document preference-based measures: wpref5
and wpref6 (Mean AR: 78% - as good as the
“median” human judge). But they statistically
significantly underperform the “best” human judge
(Mean AR: 82%).
• Best overall measures: nDCG and iRBU (Mean AR:
80% - statistically comparable to the “best”
human judge).
TAKEAWAYS

OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD

Some say we should move from absolute
graded relevance assessments to pairwise
preference assessments because...
• It is difficult to pre-define relevance grades.
• Assessor burden increases with #grades.
• Preferences can be used directly for learning-to-
rank.
Assessment cost with preferences is quadratic, but
some have explored methods to reduce it.
See [Carterette+08ECIR] etc.
See [Bashir+13SIGIR,Hui/Berberich17ICTIR,
Radinsky/Ailon11WSDM] etc.

Few studies on preference-based
evaluation measures
[Frei/Schauble91IPM] Set retrieval measures, not
ranked retrieval ones; their measures do not provide
an absolute score to a SERP.
[Carterette/Bennett08SIGIR] wpref (nDCG-like),
appref (Average Precision-like) etc.
But do preference-based measures actually align
with users’ SERP preferences? Measures are used
as surrogates of user satisfaction etc. so we should
check that they do!

Research Question in terms of
Agreement Rate
How do preference-based measures compare to
traditional measures and to human judges in terms
of Mean Agreement Rate (MAR)?
SERP1 SERP2
Measure
Judges (actually there are 15)
LEFT is better! LEFT LEFT LEFT RIGHT RIGHT
AR=3/5=60%

NTCIR-9 INTENT Japanese subtask
[Sakai/Song13IRJ]
100 topics
Qrels
(diversity)
15 runs
Diversified search topics
(with intent probabilities)
Not available in TREC diversity topics
intentwise graded
relevance (L0-L4)
Target corpus:
clueweb09-JA

SERP preferences from
[Sakai/Zeng19SIGIR]
NTCIR-9 INTENT
task
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
894 SERP pairs x 15 judges
9+ judges agreed as to
which SERP is more
relevant
Topicwise graded relevance
(L0-L4) derived
relevance and diversity
preferences collected but
we (SIGIR2020) use the
SERP relevance
preferences

Document preferences
(3 assessors per doc pair)
NTCIR-9 INTENT
task
Sakai/Zeng
SIGIR 2019
Sakai/Zeng
SIGIR 2020
(This study)
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
doc doc
doc doc
doc doc
doc doc
119,646
document
preferences
Traditional measures
Preference-based
measures

Traditional measures based on
absolute graded relevance
• nDCG [Jarvelin/Kekalainen02TOIS]
• Normalised Cumulative Utility
[Sakai/Robertson08EVIA]
[Sakai/Zeng19SIGIR]
Abandoning
probability
SERP utility
at
abandoned
rank

Preference based measures:
Pref-measures
Variants of [Carterette+08ECIR]
[Carterette/Bennett08SIGIR]
discount based on↓ binary relevance graded relevance
rank of preferred doc i wpref1 wpref4
rank of unpreferred doc j wpref2 wpref5
average ranks of i and j wpref3 wpref6
Utilise preference redundancies
More measures
discussed in paper
SERP
doc i
doc j

Explicit and implicit preferences
• wEpref: use explicit
preferences only
• wIpref: use implicit
preferences only
• wpref: use both types
SERP L=10
Preferred doc Unpreferred doc
doc i
119,646/43
= 2782 preferences/topic
on average
doc j
doc k
doc i doc j
doc k doc l
not in SERP

Preference-based measures:
Δ-measures
1. Derive continuous
graded relevance from
preferences
2. Compute existing
graded-relevance
measures
Preferred doc Unpreferred doc
doc
doc doc
doc
doc
Documents with
absolute graded
relevance
Grade based on
how often a doc is preferred over another
minus
how often another is preferred over it
(Δ-order)

FYI This presentation uses an evaluation
method that’s different from the paper
In our paper...
1. We decide on a single
gold SERP preference by
majority voting (all 15
agreed? only 9 agreed?
Information lost)
2. Compute one Kendall’s
tau value with
#concordant and
#discordant preferences
(with 95%CIs)
In this presentation...
1. We compute the MAR
over the SERP pairs
(n=894) for each
measure and each
“human” measure
2. Discuss statistical
significance based on
paired Tukey HSD test
[Sakai18book]
The measure agrees with what
proportion of users on average?
Familywise error
rate: 5%

MARs (the rest)
Δ-measures
underperform best wpref
measures on average
Eprefs substantially
underperform Iprefs

Main findings (1)
• w(I)pref5, w(I)pref6 are the most promising
preference-based measures: graded
relevance works; using the rank of the
unpreferred doc j recommended (though not
statistically significant).
• wpref5 and wIpref5 statistically significantly
outperform wEpref5, and so on. The use of
implicit preferences is very important (not
surprising since there are a lot more implicit
preferences than explicit ones).

Main findings (2)
• BUT even the best preference-based measures
underperform the best traditional measures
(nDCG, iRBU, RBP, Q) on average (78% vs 79-
81%), and statistically significantly
underperform the “best” judge (82%).
• These traditional measures are statistically
indistinguishable from the best assessor. In
particular, nDCG and iRBU statistically
significantly outperform 19 and 16 other
measures, respectively.

Discussion (1)
• It’s remarkable that measures like nDCG and iRBU
are comparable to the ``best’’ judge, and that nDCG,
iRBU, wpref5, wpref6 outperform the ``median’’
judge on average. These measures are at least as
good as an ``average person’’!
• But the best preference-based measures are not
quite as good as the ``best’’ judge – can we design
measures that perform as well as nDCG and iRBU?
• Preference-based approaches also need to
overcome the assessment cost problem (see paper).

Discussion (2)
• iRBU from [Sakai/Zeng19SIGIR] (stolen from the
diversity measure RBU [Amigo+18SIGIR] ) performs
surprisingly well – encoding graded relevance only
in the abandonment probability distribution (and
not in the utility function) seems to work...
Ignores document relevance in top r

RQ: How do traditional and document
preference-based IR measures align with
users’ SERP preferences?
• IR evaluation measures are used for improving user
satisfaction so we need to check that they align
with what users say.
• Best document preference-based measures: wpref5
and wpref6 (Mean AR: 78% - as good as the
“median” human judge). But they statistically
significantly underperform the “best” human judge
(Mean AR: 82%).
• Best overall measures: nDCG and iRBU (Mean AR:
80% - statistically comparable to the “best”
human judge).
TAKEAWAYS AGAIN

Do your own study with our data
NTCIR-9 INTENT
task
Sakai/Zeng
SIGIR 2019
Sakai/Zeng
SIGIR 2020
(This study)
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
doc doc
doc doc
doc doc
doc doc
119,646
document
preferences
Traditional measures
Preference-based
measures
https://waseda.box.com/SIGIR2019PACK https://waseda.box.com/SIGIR2020PACK

References (selected)
[Amigo+18SIGIR] https://doi.org/10.1145/3209978.3210024
[Bashir+13SIGIR] https://doi.org/10.1145/2484028.2484170
[Carterette+08ECIR] http://www.cs.cmu.edu/~pbennett/papers/HereOrThere-ECIR-
2008.pdf
[Carterette/Bennett08SIGIR] https://doi.org/10.1145/1390334.1390451
[Frei/Schauble91IPM] https://doi.org/10.1016/0306-4573(91)90046-O
[Hui/Berberich17ICTIR] https://doi.org/10.1145/3121050.3121095
[Jarvelin/Kekalainen02TOIS] https://doi.org/10.1145/582415.582418
[Radinsky/Ailon11WSDM] https://doi.org/10.1145/1935826.1935850
[Sakai18book] https://link.springer.com/book/10.1007/978-981-13-1199-4
[Sakai/Robertson08EVIA]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/07-
EVIA2008-SakaiT.pdf
[Sakai/Song13IRJ] https://link.springer.com/article/10.1007/s10791-012-9208-x
[Sakai/Zeng19SIGIR] https://doi.org/10.1145/3331184.3331215

sigir2020

More Related Content

What's hot

Similar to sigir2020

More from Tetsuya Sakai

Recently uploaded

sigir2020