Good Evaluation Measures
based on
Document Preferences
Tetsuya Sakai and Zhaohao Zeng
Waseda University, Japan
tetsuya@waseda.jp
sakailab.com
https://waseda.box.com/sigir2020preprint
SIGIR 2020 online
#StayHome
RQ: How do traditional and document
preference-based IR measures align with
users’ SERP preferences?
• Agreement Rate (AR) = how many users agree with
the measure’s decision on which SERP is better
• Best document preference-based measures: wpref5
and wpref6 (Mean AR: 78% - as good as the
“median” human judge). But they statistically
significantly underperform the “best” human judge
(Mean AR: 82%).
• Best overall measures: nDCG and iRBU (Mean AR:
80% - statistically comparable to the “best”
human judge).
TAKEAWAYS
OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD
Some say we should move from absolute
graded relevance assessments to pairwise
preference assessments because...
• It is difficult to pre-define relevance grades.
• Assessor burden increases with #grades.
• Preferences can be used directly for learning-to-
rank.
Assessment cost with preferences is quadratic, but
some have explored methods to reduce it.
See [Carterette+08ECIR] etc.
See [Bashir+13SIGIR,Hui/Berberich17ICTIR,
Radinsky/Ailon11WSDM] etc.
Few studies on preference-based
evaluation measures
[Frei/Schauble91IPM] Set retrieval measures, not
ranked retrieval ones; their measures do not provide
an absolute score to a SERP.
[Carterette/Bennett08SIGIR] wpref (nDCG-like),
appref (Average Precision-like) etc.
But do preference-based measures actually align
with users’ SERP preferences? Measures are used
as surrogates of user satisfaction etc. so we should
check that they do!
Research Question in terms of
Agreement Rate
How do preference-based measures compare to
traditional measures and to human judges in terms
of Mean Agreement Rate (MAR)?
SERP1 SERP2
Measure
Judges (actually there are 15)
LEFT is better! LEFT LEFT LEFT RIGHT RIGHT
AR=3/5=60%
OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD
NTCIR-9 INTENT Japanese subtask
[Sakai/Song13IRJ]
100 topics
Qrels
(diversity)
15 runs
Diversified search topics
(with intent probabilities)
Not available in TREC diversity topics
intentwise graded
relevance (L0-L4)
Target corpus:
clueweb09-JA
SERP preferences from
[Sakai/Zeng19SIGIR]
NTCIR-9 INTENT
task
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
894 SERP pairs x 15 judges
9+ judges agreed as to
which SERP is more
relevant
Topicwise graded relevance
(L0-L4) derived
relevance and diversity
preferences collected but
we (SIGIR2020) use the
SERP relevance
preferences
Document preferences
(3 assessors per doc pair)
NTCIR-9 INTENT
task
Sakai/Zeng
SIGIR 2019
Sakai/Zeng
SIGIR 2020
(This study)
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
894 SERP pairs x 15 judges
doc doc
doc doc
doc doc
doc doc
119,646
document
preferences
Traditional measures
Preference-based
measures
OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD
Traditional measures based on
absolute graded relevance
• nDCG [Jarvelin/Kekalainen02TOIS]
• Normalised Cumulative Utility
[Sakai/Robertson08EVIA]
[Sakai/Zeng19SIGIR]
Abandoning
probability
SERP utility
at
abandoned
rank
Preference based measures:
Pref-measures
Variants of [Carterette+08ECIR]
[Carterette/Bennett08SIGIR]
discount based on↓ binary relevance graded relevance
rank of preferred doc i wpref1 wpref4
rank of unpreferred doc j wpref2 wpref5
average ranks of i and j wpref3 wpref6
Utilise preference redundancies
More measures
discussed in paper
SERP
doc i
doc j
Explicit and implicit preferences
• wEpref: use explicit
preferences only
• wIpref: use implicit
preferences only
• wpref: use both types
SERP L=10
Document preferences
Preferred doc Unpreferred doc
doc i
119,646/43
= 2782 preferences/topic
on average
doc j
doc k
doc i doc j
doc k doc l
not in SERP
Preference-based measures:
Δ-measures
1. Derive continuous
graded relevance from
preferences
2. Compute existing
graded-relevance
measures
Document preferences
Preferred doc Unpreferred doc
doc
doc doc
doc
doc
Documents with
absolute graded
relevance
Grade based on
how often a doc is preferred over another
minus
how often another is preferred over it
(Δ-order)
OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD
FYI This presentation uses an evaluation
method that’s different from the paper
In our paper...
1. We decide on a single
gold SERP preference by
majority voting (all 15
agreed? only 9 agreed?
Information lost)
2. Compute one Kendall’s
tau value with
#concordant and
#discordant preferences
(with 95%CIs)
In this presentation...
1. We compute the MAR
over the SERP pairs
(n=894) for each
measure and each
“human” measure
2. Discuss statistical
significance based on
paired Tukey HSD test
[Sakai18book]
The measure agrees with what
proportion of users on average?
Familywise error
rate: 5%
MARs (top measures only)
MARs (the rest)
Δ-measures
underperform best wpref
measures on average
Eprefs substantially
underperform Iprefs
Main findings (1)
• w(I)pref5, w(I)pref6 are the most promising
preference-based measures: graded
relevance works; using the rank of the
unpreferred doc j recommended (though not
statistically significant).
• wpref5 and wIpref5 statistically significantly
outperform wEpref5, and so on. The use of
implicit preferences is very important (not
surprising since there are a lot more implicit
preferences than explicit ones).
Main findings (2)
• BUT even the best preference-based measures
underperform the best traditional measures
(nDCG, iRBU, RBP, Q) on average (78% vs 79-
81%), and statistically significantly
underperform the “best” judge (82%).
• These traditional measures are statistically
indistinguishable from the best assessor. In
particular, nDCG and iRBU statistically
significantly outperform 19 and 16 other
measures, respectively.
Discussion (1)
• It’s remarkable that measures like nDCG and iRBU
are comparable to the ``best’’ judge, and that nDCG,
iRBU, wpref5, wpref6 outperform the ``median’’
judge on average. These measures are at least as
good as an ``average person’’!
• But the best preference-based measures are not
quite as good as the ``best’’ judge – can we design
measures that perform as well as nDCG and iRBU?
• Preference-based approaches also need to
overcome the assessment cost problem (see paper).
Discussion (2)
• iRBU from [Sakai/Zeng19SIGIR] (stolen from the
diversity measure RBU [Amigo+18SIGIR] ) performs
surprisingly well – encoding graded relevance only
in the abandonment probability distribution (and
not in the utility function) seems to work...
Ignores document relevance in top r
OUTLINE
1. Motivation
2. Data
3. Measures
4. Results
5. Takeaways and Downloads
SPONSORED AD
RQ: How do traditional and document
preference-based IR measures align with
users’ SERP preferences?
• IR evaluation measures are used for improving user
satisfaction so we need to check that they align
with what users say.
• Best document preference-based measures: wpref5
and wpref6 (Mean AR: 78% - as good as the
“median” human judge). But they statistically
significantly underperform the “best” human judge
(Mean AR: 82%).
• Best overall measures: nDCG and iRBU (Mean AR:
80% - statistically comparable to the “best”
human judge).
TAKEAWAYS AGAIN
Do your own study with our data
NTCIR-9 INTENT
task
Sakai/Zeng
SIGIR 2019
Sakai/Zeng
SIGIR 2020
(This study)
43 topics100 topics
Qrels
(diversity)
15 runs
SERP SERP
SERP SERP
SERP SERP
SERP SERP
Qrels (adhoc)
894 SERP pairs x 15 judges
doc doc
doc doc
doc doc
doc doc
119,646
document
preferences
Traditional measures
Preference-based
measures
https://waseda.box.com/SIGIR2019PACK https://waseda.box.com/SIGIR2020PACK
References (selected)
[Amigo+18SIGIR] https://doi.org/10.1145/3209978.3210024
[Bashir+13SIGIR] https://doi.org/10.1145/2484028.2484170
[Carterette+08ECIR] http://www.cs.cmu.edu/~pbennett/papers/HereOrThere-ECIR-
2008.pdf
[Carterette/Bennett08SIGIR] https://doi.org/10.1145/1390334.1390451
[Frei/Schauble91IPM] https://doi.org/10.1016/0306-4573(91)90046-O
[Hui/Berberich17ICTIR] https://doi.org/10.1145/3121050.3121095
[Jarvelin/Kekalainen02TOIS] https://doi.org/10.1145/582415.582418
[Radinsky/Ailon11WSDM] https://doi.org/10.1145/1935826.1935850
[Sakai18book] https://link.springer.com/book/10.1007/978-981-13-1199-4
[Sakai/Robertson08EVIA]
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/07-
EVIA2008-SakaiT.pdf
[Sakai/Song13IRJ] https://link.springer.com/article/10.1007/s10791-012-9208-x
[Sakai/Zeng19SIGIR] https://doi.org/10.1145/3331184.3331215

sigir2020

  • 1.
    Good Evaluation Measures basedon Document Preferences Tetsuya Sakai and Zhaohao Zeng Waseda University, Japan tetsuya@waseda.jp sakailab.com https://waseda.box.com/sigir2020preprint SIGIR 2020 online #StayHome
  • 2.
    RQ: How dotraditional and document preference-based IR measures align with users’ SERP preferences? • Agreement Rate (AR) = how many users agree with the measure’s decision on which SERP is better • Best document preference-based measures: wpref5 and wpref6 (Mean AR: 78% - as good as the “median” human judge). But they statistically significantly underperform the “best” human judge (Mean AR: 82%). • Best overall measures: nDCG and iRBU (Mean AR: 80% - statistically comparable to the “best” human judge). TAKEAWAYS
  • 3.
    OUTLINE 1. Motivation 2. Data 3.Measures 4. Results 5. Takeaways and Downloads SPONSORED AD
  • 4.
    Some say weshould move from absolute graded relevance assessments to pairwise preference assessments because... • It is difficult to pre-define relevance grades. • Assessor burden increases with #grades. • Preferences can be used directly for learning-to- rank. Assessment cost with preferences is quadratic, but some have explored methods to reduce it. See [Carterette+08ECIR] etc. See [Bashir+13SIGIR,Hui/Berberich17ICTIR, Radinsky/Ailon11WSDM] etc.
  • 5.
    Few studies onpreference-based evaluation measures [Frei/Schauble91IPM] Set retrieval measures, not ranked retrieval ones; their measures do not provide an absolute score to a SERP. [Carterette/Bennett08SIGIR] wpref (nDCG-like), appref (Average Precision-like) etc. But do preference-based measures actually align with users’ SERP preferences? Measures are used as surrogates of user satisfaction etc. so we should check that they do!
  • 6.
    Research Question interms of Agreement Rate How do preference-based measures compare to traditional measures and to human judges in terms of Mean Agreement Rate (MAR)? SERP1 SERP2 Measure Judges (actually there are 15) LEFT is better! LEFT LEFT LEFT RIGHT RIGHT AR=3/5=60%
  • 7.
    OUTLINE 1. Motivation 2. Data 3.Measures 4. Results 5. Takeaways and Downloads SPONSORED AD
  • 8.
    NTCIR-9 INTENT Japanesesubtask [Sakai/Song13IRJ] 100 topics Qrels (diversity) 15 runs Diversified search topics (with intent probabilities) Not available in TREC diversity topics intentwise graded relevance (L0-L4) Target corpus: clueweb09-JA
  • 9.
    SERP preferences from [Sakai/Zeng19SIGIR] NTCIR-9INTENT task 43 topics100 topics Qrels (diversity) 15 runs SERP SERP SERP SERP SERP SERP SERP SERP Qrels (adhoc) 894 SERP pairs x 15 judges 9+ judges agreed as to which SERP is more relevant Topicwise graded relevance (L0-L4) derived relevance and diversity preferences collected but we (SIGIR2020) use the SERP relevance preferences
  • 10.
    Document preferences (3 assessorsper doc pair) NTCIR-9 INTENT task Sakai/Zeng SIGIR 2019 Sakai/Zeng SIGIR 2020 (This study) 43 topics100 topics Qrels (diversity) 15 runs SERP SERP SERP SERP SERP SERP SERP SERP Qrels (adhoc) 894 SERP pairs x 15 judges doc doc doc doc doc doc doc doc 119,646 document preferences Traditional measures Preference-based measures
  • 11.
    OUTLINE 1. Motivation 2. Data 3.Measures 4. Results 5. Takeaways and Downloads SPONSORED AD
  • 12.
    Traditional measures basedon absolute graded relevance • nDCG [Jarvelin/Kekalainen02TOIS] • Normalised Cumulative Utility [Sakai/Robertson08EVIA] [Sakai/Zeng19SIGIR] Abandoning probability SERP utility at abandoned rank
  • 13.
    Preference based measures: Pref-measures Variantsof [Carterette+08ECIR] [Carterette/Bennett08SIGIR] discount based on↓ binary relevance graded relevance rank of preferred doc i wpref1 wpref4 rank of unpreferred doc j wpref2 wpref5 average ranks of i and j wpref3 wpref6 Utilise preference redundancies More measures discussed in paper SERP doc i doc j
  • 14.
    Explicit and implicitpreferences • wEpref: use explicit preferences only • wIpref: use implicit preferences only • wpref: use both types SERP L=10 Document preferences Preferred doc Unpreferred doc doc i 119,646/43 = 2782 preferences/topic on average doc j doc k doc i doc j doc k doc l not in SERP
  • 15.
    Preference-based measures: Δ-measures 1. Derivecontinuous graded relevance from preferences 2. Compute existing graded-relevance measures Document preferences Preferred doc Unpreferred doc doc doc doc doc doc Documents with absolute graded relevance Grade based on how often a doc is preferred over another minus how often another is preferred over it (Δ-order)
  • 16.
    OUTLINE 1. Motivation 2. Data 3.Measures 4. Results 5. Takeaways and Downloads SPONSORED AD
  • 17.
    FYI This presentationuses an evaluation method that’s different from the paper In our paper... 1. We decide on a single gold SERP preference by majority voting (all 15 agreed? only 9 agreed? Information lost) 2. Compute one Kendall’s tau value with #concordant and #discordant preferences (with 95%CIs) In this presentation... 1. We compute the MAR over the SERP pairs (n=894) for each measure and each “human” measure 2. Discuss statistical significance based on paired Tukey HSD test [Sakai18book] The measure agrees with what proportion of users on average? Familywise error rate: 5%
  • 18.
  • 19.
    MARs (the rest) Δ-measures underperformbest wpref measures on average Eprefs substantially underperform Iprefs
  • 20.
    Main findings (1) •w(I)pref5, w(I)pref6 are the most promising preference-based measures: graded relevance works; using the rank of the unpreferred doc j recommended (though not statistically significant). • wpref5 and wIpref5 statistically significantly outperform wEpref5, and so on. The use of implicit preferences is very important (not surprising since there are a lot more implicit preferences than explicit ones).
  • 21.
    Main findings (2) •BUT even the best preference-based measures underperform the best traditional measures (nDCG, iRBU, RBP, Q) on average (78% vs 79- 81%), and statistically significantly underperform the “best” judge (82%). • These traditional measures are statistically indistinguishable from the best assessor. In particular, nDCG and iRBU statistically significantly outperform 19 and 16 other measures, respectively.
  • 22.
    Discussion (1) • It’sremarkable that measures like nDCG and iRBU are comparable to the ``best’’ judge, and that nDCG, iRBU, wpref5, wpref6 outperform the ``median’’ judge on average. These measures are at least as good as an ``average person’’! • But the best preference-based measures are not quite as good as the ``best’’ judge – can we design measures that perform as well as nDCG and iRBU? • Preference-based approaches also need to overcome the assessment cost problem (see paper).
  • 23.
    Discussion (2) • iRBUfrom [Sakai/Zeng19SIGIR] (stolen from the diversity measure RBU [Amigo+18SIGIR] ) performs surprisingly well – encoding graded relevance only in the abandonment probability distribution (and not in the utility function) seems to work... Ignores document relevance in top r
  • 24.
    OUTLINE 1. Motivation 2. Data 3.Measures 4. Results 5. Takeaways and Downloads SPONSORED AD
  • 25.
    RQ: How dotraditional and document preference-based IR measures align with users’ SERP preferences? • IR evaluation measures are used for improving user satisfaction so we need to check that they align with what users say. • Best document preference-based measures: wpref5 and wpref6 (Mean AR: 78% - as good as the “median” human judge). But they statistically significantly underperform the “best” human judge (Mean AR: 82%). • Best overall measures: nDCG and iRBU (Mean AR: 80% - statistically comparable to the “best” human judge). TAKEAWAYS AGAIN
  • 26.
    Do your ownstudy with our data NTCIR-9 INTENT task Sakai/Zeng SIGIR 2019 Sakai/Zeng SIGIR 2020 (This study) 43 topics100 topics Qrels (diversity) 15 runs SERP SERP SERP SERP SERP SERP SERP SERP Qrels (adhoc) 894 SERP pairs x 15 judges doc doc doc doc doc doc doc doc 119,646 document preferences Traditional measures Preference-based measures https://waseda.box.com/SIGIR2019PACK https://waseda.box.com/SIGIR2020PACK
  • 27.
    References (selected) [Amigo+18SIGIR] https://doi.org/10.1145/3209978.3210024 [Bashir+13SIGIR]https://doi.org/10.1145/2484028.2484170 [Carterette+08ECIR] http://www.cs.cmu.edu/~pbennett/papers/HereOrThere-ECIR- 2008.pdf [Carterette/Bennett08SIGIR] https://doi.org/10.1145/1390334.1390451 [Frei/Schauble91IPM] https://doi.org/10.1016/0306-4573(91)90046-O [Hui/Berberich17ICTIR] https://doi.org/10.1145/3121050.3121095 [Jarvelin/Kekalainen02TOIS] https://doi.org/10.1145/582415.582418 [Radinsky/Ailon11WSDM] https://doi.org/10.1145/1935826.1935850 [Sakai18book] https://link.springer.com/book/10.1007/978-981-13-1199-4 [Sakai/Robertson08EVIA] http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/07- EVIA2008-SakaiT.pdf [Sakai/Song13IRJ] https://link.springer.com/article/10.1007/s10791-012-9208-x [Sakai/Zeng19SIGIR] https://doi.org/10.1145/3331184.3331215