Click Model-Based Information Retrieval Metrics

Click Model-Based Information Retrieval Metrics
Aleksandr Chuklin˚ 1,2, Pavel Serdyukov1, Maarten de Rijke2
1Yandex, Moscow, Russia
2ISLA, University of Amsterdam, The Netherlands
SIGIR 2013
Dublin, Ireland
˚
Now at Google Switzerland
1 / 24

§ IR Metrics Overview
§ Click Model-Based Metrics
§ Analysis of the New Metrics
2 / 24

Classiﬁcation of IR evaluation techniques
Oﬄine Metrics
Traditional Click Model-Based
Precision uSDBN, ERR (Chapelle et al., 2009)
nDCG, DCG EBU (Yilmaz et al., 2010), rrDBN
MAP uDCM, rrDCM
uUBM
Online Experiments
Absolute Metrics Interleaving
MaxRR, MinRR, MeanRR Team-Draft Interleaving
UCTR, QCTR Balanced Interleaving
PLC
3 / 24

Oﬄine metrics
§ Fixed set of queries Q
§ Documents are assessed by human judges using graded
relevance R P t0, 1, . . . , Rmax u
SystemQuality “
1
|Q|
ÿ
qPQ
Utilitypqq
§ Where Utility usually has the following form:
Utilitypqq “
Nÿ
i“1
decayi ¨ Rpdoci q
4 / 24

Click metrics: DBN
Example: DBN click model
(Chapelle and Zhang, 2009)
tion
odel
(3)
any
ion.
ting
tion
nvex
dels
ame
ulta-
arch
each
d re-
the
rob-
ant)
as-
user
ment
anks
The
EiEi 1 Ei+1
Ci
Ai Si
au su
Figure 1: The DBN used for clicks modeling. Ci is
the the only observed variable.
position, the following hidden binary variables are defined
to model examination, perceived relevance, and actual rele-
vance, respectively:
• Ei: did the user examine the url?
• Ai: was the user attracted by the url?
• Si: was the user satisfied by the landing page?
The following equations describe the model:
Track: Data Mining / Session: Click Models
§ Ci — user clicked i-th
document
§ Ei — user examined i-th
document
§ Ai — user was attracted by
i-th document
§ Si — user was satisfied by
i-th document
Ck “ 1 ô Ak “ 1 and Ek “ 1
PpAk “ 1q “ aqpukq
PpSk “ 1|Ck “ 0q “ 0
PpSk “ 1|Ck “ 1q “ sqpukq
Ek`1 “ 1 ô Ek “ 1 and Sk “ 0
5 / 24

Converting click model into metric
§ aqpukq Ñ aqpRkq, sqpukq Ñ sqpRkq
§ Compute click probability Ci and satisfaction probability Si
§ Use the following equations for utility-based and eﬀort-based
(reciprocal rank) metrics (similar to (Carterette, 2011)):
uMetric “
Nÿ
k“1
PpCk “ 1q ¨ Rk (utility-based)
rrMetric “
Nÿ
k“1
PpSk “ 1q ¨
1
k
(eﬀort-based)
Implementation:
https://github.com/varepsilon/clickmodels
6 / 24

Click model-based metrics and their underlying models
Derived metric
Underlying click model Utility-based Eﬀort-based
SDBN (Chapelle and Zhang, 2009) uSDBN ERR
DBN (Chapelle and Zhang, 2009) EBU rrDBN
DCM (Guo et al., 2009) uDCM rrDCM
UBM (Dupret and Piwowarski, 2008) uUBM –
Previous work:
§ ERR, uSDBN (Chapelle et al., 2009)
§ EBU (Yilmaz et al., 2010)
7 / 24

Evaluating the metrics
§ Correlation with other metrics
§ Correlation with click metrics
§ Correlation with interleaving
Hypothesis
Model-based metrics should be better correlated with online user
metrics.
8 / 24

Aspect one: comparison to other metrics
Table: TREC 2011 runs, Kendall tau correlation. Values higher than 0.9
are marked in boldface.
Precision2 DCG ERR uSDBN EBU rrDBN uDCM rrDCM uUBM
Precision 0.649 0.841 0.597 0.730 0.568 0.397 0.562 0.442 0.537
Precision2 – 0.785 0.663 0.780 0.675 0.526 0.693 0.551 0.681
DCG – – 0.740 0.857 0.711 0.530 0.704 0.592 0.685
ERR – – – 0.807 0.919 0.754 0.902 0.826 0.888
uSDBN – – – – 0.792 0.585 0.794 0.638 0.754
EBU – – – – – 0.788 0.970 0.822 0.930
rrDBN – – – – – – 0.786 0.917 0.807
uDCM – – – – – – – 0.813 0.947
rrDCM – – – – – – – – 0.841
9 / 24

Model-based metrics
Hypothesis
Model-based metrics should be better correlated with online user
metrics.
10 / 24

Aspect two: absolute online metrics
Table: Pearson correlation between offline and absolute click metrics.
Superscripts show statistically significant difference from ERR and EBU.
-RR
Max- Min- Mean- UCTR PLC
Precision ´0.117 ´0.163 ´0.155 0.042 ´0.027
Precision2 0.026 0.093 0.075 0.092 0.094
DCG 0.178 0.243 0.237 0.163 0.245
ERR 0.378 0.471 0.469 0.199 0.399
EBU 0.374 0.467 0.464 0.198 0.397
rrDBN 0.384ĲĲ 0.475ĲĲ 0.473ĲĲ 0.194İİ 0.399´Ĳ
rrDCM 0.387ĲĲ 0.478ĲĲ 0.476ĲĲ 0.194İİ 0.400´Ĳ
uSDBN 0.322İİ 0.412İİ 0.407İİ 0.206ĲĲ 0.370İİ
uDCM 0.374İİ 0.466İİ 0.463İİ 0.198´´ 0.396İİ
uUBM 0.377´Ĳ 0.469İĲ 0.467İĲ 0.198´´ 0.398´Ĳ
11 / 24

Aspect three: interleaving
Large Scale Validation and Analysis of Interleaved Search Evaluation A:5
Input Interleaved Rankings
Ranking Balanced Team-Draft
Rank A B A first B first AAA BAA ABA ...
1 a b a b aA bB aA
2 b e b a bB aA bB
3 c a e e cA cA eB
4 d f c c eB eB cA
5 g g d f dA dA dA
6 h h f d fB fB fB
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
Fig. 1. Examples illustrating how Balanced and Team-Draft Interleaving combine input rankings A and B
over different randomizations. Superscript for the Team-Draft interleavings indicates team membership.
Interleaving methods address these problems by merging the two rankings A and B
into a single interleaved ranking I, which is presented to the user. The retrieval system
observes clicks on the documents in I and attributes them to A, B, or both, depending
on the origin of the document. The goal is to make the interleaving process and click at-
tribution as “fair” as possible with respect to biases in user behavior (e.g. position bias
[Joachims et al. 2007]), so that clicks in the interleaved ranking I can be interpreted as
unbiased feedback for a paired comparison between A and B. The precise definition of
“fair” varies for different interleaving methods, but all have the goal of equalizing the
influence of biases on clicks in I for A and B. This equalization of behavioral biases is12 / 24

Interleaving vs. oﬄine metrics
§ 10 Team-Draft Interleaving Experiments ∆i AB.
§ For each experiment compute TdiSignal “ WinB
WinA`WinB
´ 1
2
§ Judged query-document pairs matched against click log giving
set of queries Q (|Q| „ 102 . . . 103); some documents may be
unjudged (up to #unjudged docs per query)
§ For each metric compute:
MetricSignal “
1
|Q1|
ÿ
qPQ1
pMetricBpqq ´ MetricApqqq ,
where Q1 “ tq P Q | MetricBpqq ‰ MetricApqqu
§ Compare MetricSignal to TdiSignal using Pearson
Correlation (similar to (Radlinski and Craswell, 2010))
13 / 24

Interleaving vs. oﬄine metrics
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Simple Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
Figure: Unjudged documents considered irrelevant
14 / 24

Making use of unjudged documents
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Condensed Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
Figure: Method by Sakai, T. Alternatives to Bpref. SIGIR’2007:
unjudged documents skipped (result page is condensed)
15 / 24

Thresholds
§ Modify oﬄine metric usage protocol. Introduce a threshold δ:
MetricSignal “
1
|Qδ|
ÿ
qPQδ
pMetricBpqq ´ MetricApqqq ,
where Qδ “ tq P Q | |MetricBpqq ´ MetricApqq| ą δu
§ Choose a threshold to maximize correlation with interleaving
§ Use 5 experiments to tune thresholds and 5 thresholds to test.
Repeat for each possible 5/5 split (total C5
10 “ 252 splits)
16 / 24

Thresholds
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation Thresholded Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
17 / 24

Thresholds+condensation
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation Thresholded Condensed Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
18 / 24

All in one
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Simple Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Condensed Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Thresholded Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Thresholded Condensed Metrics
Precision
Precision2
DCG
uSDBN
ERR
EBU
rrDBN
uDCM
rrDCM
uUBM
19 / 24

Summary
§ A recipe for turning a click model into a metric
§ Two families of metrics: utility-based and eﬀort-based
§ Multi-aspect analysis of the metrics
20 / 24

Key results
§ Eﬀort-based metrics are substantially diﬀerent from
utility-based ones, even when based on the same user model
§ Model-based metrics show better agreement with
interleaving and better deal with unjudged documents
§ Using techniques such as condensation and threshold we
can improve agreement with interleaving
21 / 24

What’s next?
§ Judging snippets. Drop the assumption that snippet
attractiveness is a function of document relevance as was
assumed by the click model-based metrics
§ Good abandonments. Modify any evaluation metric by
adding additional gain from the snippets that contain an
answer to the user’s information need
22 / 24

Bibiography
B. Carterette. System eﬀectiveness, user models, and user utility:
a conceptual framework for investigation. In SIGIR, 2011.
O. Chapelle and Y. Zhang. A dynamic bayesian network click
model for web search ranking. In WWW. ACM, 2009.
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected
reciprocal rank for graded relevance. In CIKM. ACM, 2009.
G. Dupret and B. Piwowarski. A user browsing model to predict
search engine click data from past observations. In SIGIR. ACM,
2008.
F. Guo, C. Liu, and Y. Wang. Eﬃcient multiple-click models in
web search. In WSDM. ACM, 2009.
F. Radlinski and N. Craswell. Comparing the sensitivity of
information retrieval metrics. In SIGIR. ACM, 2010.
E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. Expected
browsing utility for web search evaluation. In CIKM. ACM, 2010.
24 / 24

Click Model-Based Information Retrieval Metrics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Click Model-Based Information Retrieval Metrics

Similar to Click Model-Based Information Retrieval Metrics (20)

Recently uploaded

Recently uploaded (20)

Click Model-Based Information Retrieval Metrics