Click Model-Based Information Retrieval Metrics
Aleksandr Chuklin˚ 1,2, Pavel Serdyukov1, Maarten de Rijke2
1Yandex, Mosco...
§ IR Metrics Overview
§ Click Model-Based Metrics
§ Analysis of the New Metrics
2 / 24
§ IR Metrics Overview
§ Click Model-Based Metrics
§ Analysis of the New Metrics
2 / 24
§ IR Metrics Overview
§ Click Model-Based Metrics
§ Analysis of the New Metrics
2 / 24
Classification of IR evaluation techniques
Offline Metrics
Traditional Click Model-Based
Precision uSDBN, ERR (Chapelle et al...
Offline metrics
§ Fixed set of queries Q
§ Documents are assessed by human judges using graded
relevance R P t0, 1, . . . , ...
Click metrics: DBN
Example: DBN click model
(Chapelle and Zhang, 2009)
tion
odel
(3)
any
ion.
ting
tion
nvex
dels
ame
ulta...
Converting click model into metric
§ aqpukq Ñ aqpRkq, sqpukq Ñ sqpRkq
§ Compute click probability Ci and satisfaction prob...
Click model-based metrics and their underlying models
Derived metric
Underlying click model Utility-based Effort-based
SDBN...
Evaluating the metrics
§ Correlation with other metrics
§ Correlation with click metrics
§ Correlation with interleaving
H...
Aspect one: comparison to other metrics
Table: TREC 2011 runs, Kendall tau correlation. Values higher than 0.9
are marked ...
Model-based metrics
Hypothesis
Model-based metrics should be better correlated with online user
metrics.
10 / 24
Aspect two: absolute online metrics
Table: Pearson correlation between offline and absolute click metrics.
Superscripts show...
Aspect three: interleaving
Large Scale Validation and Analysis of Interleaved Search Evaluation A:5
Input Interleaved Rank...
Interleaving vs. offline metrics
§ 10 Team-Draft Interleaving Experiments ∆i AB.
§ For each experiment compute TdiSignal “ W...
Interleaving vs. offline metrics
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Simple Metrics...
Making use of unjudged documents
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Condensed Me...
Thresholds
§ Modify offline metric usage protocol. Introduce a threshold δ:
MetricSignal “
1
|Qδ|
ÿ
qPQδ
pMetricBpqq ´ Metri...
Thresholds
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation Thresholded Metrics
Precision
Prec...
Thresholds+condensation
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation Thresholded Condensed...
All in one
0 1 2 3 4 5 6 7 8 9 10
#unjudged
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
correlation
Simple Metrics
Precision
Precision...
Summary
§ A recipe for turning a click model into a metric
§ Two families of metrics: utility-based and effort-based
§ Mult...
Key results
§ Effort-based metrics are substantially different from
utility-based ones, even when based on the same user mod...
What’s next?
§ Judging snippets. Drop the assumption that snippet
attractiveness is a function of document relevance as wa...
23 / 24
Bibiography
B. Carterette. System effectiveness, user models, and user utility:
a conceptual framework for investigation. I...
Upcoming SlideShare
Loading in …5
×

Click Model-Based Information Retrieval Metrics

549 views

Published on

Slides from SIGIR 2013 talk. The full paper can be found here: http://staff.science.uva.nl/~mdr/Publications/sigir2013-metrics.pdf

ABSTRACT: In recent years many models have been proposed that are aimed at predicting clicks of web search users. In addition, some information retrieval evaluation metrics have been built on top of a user model. In this paper we bring these two directions together and propose a common approach to converting any click model into an evaluation metric. We then put the resulting model-based metrics as well as traditional metrics (like DCG or Precision) into a common evaluation framework and compare them along a number of dimensions.

One of the dimensions we are particularly interested in is the agreement between offline and online experimental outcomes. It is widely believed, especially in an industrial setting, that online A/B-testing and interleaving experiments are generally better at capturing system quality than offline measurements. We show that offline metrics that are based on click models are more strongly correlated with online experimental outcomes than traditional offline metrics, especially in situations when we have incomplete relevance judgements.

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
549
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Click Model-Based Information Retrieval Metrics

  1. 1. Click Model-Based Information Retrieval Metrics Aleksandr Chuklin˚ 1,2, Pavel Serdyukov1, Maarten de Rijke2 1Yandex, Moscow, Russia 2ISLA, University of Amsterdam, The Netherlands SIGIR 2013 Dublin, Ireland ˚ Now at Google Switzerland 1 / 24
  2. 2. § IR Metrics Overview § Click Model-Based Metrics § Analysis of the New Metrics 2 / 24
  3. 3. § IR Metrics Overview § Click Model-Based Metrics § Analysis of the New Metrics 2 / 24
  4. 4. § IR Metrics Overview § Click Model-Based Metrics § Analysis of the New Metrics 2 / 24
  5. 5. Classification of IR evaluation techniques Offline Metrics Traditional Click Model-Based Precision uSDBN, ERR (Chapelle et al., 2009) nDCG, DCG EBU (Yilmaz et al., 2010), rrDBN MAP uDCM, rrDCM uUBM Online Experiments Absolute Metrics Interleaving MaxRR, MinRR, MeanRR Team-Draft Interleaving UCTR, QCTR Balanced Interleaving PLC 3 / 24
  6. 6. Offline metrics § Fixed set of queries Q § Documents are assessed by human judges using graded relevance R P t0, 1, . . . , Rmax u SystemQuality “ 1 |Q| ÿ qPQ Utilitypqq § Where Utility usually has the following form: Utilitypqq “ Nÿ i“1 decayi ¨ Rpdoci q 4 / 24
  7. 7. Click metrics: DBN Example: DBN click model (Chapelle and Zhang, 2009) tion odel (3) any ion. ting tion nvex dels ame ulta- arch each d re- the rob- ant) as- user ment anks The EiEi 1 Ei+1 Ci Ai Si au su Figure 1: The DBN used for clicks modeling. Ci is the the only observed variable. position, the following hidden binary variables are defined to model examination, perceived relevance, and actual rele- vance, respectively: • Ei: did the user examine the url? • Ai: was the user attracted by the url? • Si: was the user satisfied by the landing page? The following equations describe the model: Track: Data Mining / Session: Click Models § Ci — user clicked i-th document § Ei — user examined i-th document § Ai — user was attracted by i-th document § Si — user was satisfied by i-th document Ck “ 1 ô Ak “ 1 and Ek “ 1 PpAk “ 1q “ aqpukq PpSk “ 1|Ck “ 0q “ 0 PpSk “ 1|Ck “ 1q “ sqpukq Ek`1 “ 1 ô Ek “ 1 and Sk “ 0 5 / 24
  8. 8. Converting click model into metric § aqpukq Ñ aqpRkq, sqpukq Ñ sqpRkq § Compute click probability Ci and satisfaction probability Si § Use the following equations for utility-based and effort-based (reciprocal rank) metrics (similar to (Carterette, 2011)): uMetric “ Nÿ k“1 PpCk “ 1q ¨ Rk (utility-based) rrMetric “ Nÿ k“1 PpSk “ 1q ¨ 1 k (effort-based) Implementation: https://github.com/varepsilon/clickmodels 6 / 24
  9. 9. Click model-based metrics and their underlying models Derived metric Underlying click model Utility-based Effort-based SDBN (Chapelle and Zhang, 2009) uSDBN ERR DBN (Chapelle and Zhang, 2009) EBU rrDBN DCM (Guo et al., 2009) uDCM rrDCM UBM (Dupret and Piwowarski, 2008) uUBM – Previous work: § ERR, uSDBN (Chapelle et al., 2009) § EBU (Yilmaz et al., 2010) 7 / 24
  10. 10. Evaluating the metrics § Correlation with other metrics § Correlation with click metrics § Correlation with interleaving Hypothesis Model-based metrics should be better correlated with online user metrics. 8 / 24
  11. 11. Aspect one: comparison to other metrics Table: TREC 2011 runs, Kendall tau correlation. Values higher than 0.9 are marked in boldface. Precision2 DCG ERR uSDBN EBU rrDBN uDCM rrDCM uUBM Precision 0.649 0.841 0.597 0.730 0.568 0.397 0.562 0.442 0.537 Precision2 – 0.785 0.663 0.780 0.675 0.526 0.693 0.551 0.681 DCG – – 0.740 0.857 0.711 0.530 0.704 0.592 0.685 ERR – – – 0.807 0.919 0.754 0.902 0.826 0.888 uSDBN – – – – 0.792 0.585 0.794 0.638 0.754 EBU – – – – – 0.788 0.970 0.822 0.930 rrDBN – – – – – – 0.786 0.917 0.807 uDCM – – – – – – – 0.813 0.947 rrDCM – – – – – – – – 0.841 9 / 24
  12. 12. Model-based metrics Hypothesis Model-based metrics should be better correlated with online user metrics. 10 / 24
  13. 13. Aspect two: absolute online metrics Table: Pearson correlation between offline and absolute click metrics. Superscripts show statistically significant difference from ERR and EBU. -RR Max- Min- Mean- UCTR PLC Precision ´0.117 ´0.163 ´0.155 0.042 ´0.027 Precision2 0.026 0.093 0.075 0.092 0.094 DCG 0.178 0.243 0.237 0.163 0.245 ERR 0.378 0.471 0.469 0.199 0.399 EBU 0.374 0.467 0.464 0.198 0.397 rrDBN 0.384IJIJ 0.475IJIJ 0.473IJIJ 0.194İİ 0.399´IJ rrDCM 0.387IJIJ 0.478IJIJ 0.476IJIJ 0.194İİ 0.400´IJ uSDBN 0.322İİ 0.412İİ 0.407İİ 0.206IJIJ 0.370İİ uDCM 0.374İİ 0.466İİ 0.463İİ 0.198´´ 0.396İİ uUBM 0.377´IJ 0.469İIJ 0.467İIJ 0.198´´ 0.398´IJ 11 / 24
  14. 14. Aspect three: interleaving Large Scale Validation and Analysis of Interleaved Search Evaluation A:5 Input Interleaved Rankings Ranking Balanced Team-Draft Rank A B A first B first AAA BAA ABA ... 1 a b a b aA bB aA 2 b e b a bB aA bB 3 c a e e cA cA eB 4 d f c c eB eB cA 5 g g d f dA dA dA 6 h h f d fB fB fB . .. . .. . .. . .. . .. . .. . .. . .. Fig. 1. Examples illustrating how Balanced and Team-Draft Interleaving combine input rankings A and B over different randomizations. Superscript for the Team-Draft interleavings indicates team membership. Interleaving methods address these problems by merging the two rankings A and B into a single interleaved ranking I, which is presented to the user. The retrieval system observes clicks on the documents in I and attributes them to A, B, or both, depending on the origin of the document. The goal is to make the interleaving process and click at- tribution as “fair” as possible with respect to biases in user behavior (e.g. position bias [Joachims et al. 2007]), so that clicks in the interleaved ranking I can be interpreted as unbiased feedback for a paired comparison between A and B. The precise definition of “fair” varies for different interleaving methods, but all have the goal of equalizing the influence of biases on clicks in I for A and B. This equalization of behavioral biases is12 / 24
  15. 15. Interleaving vs. offline metrics § 10 Team-Draft Interleaving Experiments ∆i AB. § For each experiment compute TdiSignal “ WinB WinA`WinB ´ 1 2 § Judged query-document pairs matched against click log giving set of queries Q (|Q| „ 102 . . . 103); some documents may be unjudged (up to #unjudged docs per query) § For each metric compute: MetricSignal “ 1 |Q1| ÿ qPQ1 pMetricBpqq ´ MetricApqqq , where Q1 “ tq P Q | MetricBpqq ‰ MetricApqqu § Compare MetricSignal to TdiSignal using Pearson Correlation (similar to (Radlinski and Craswell, 2010)) 13 / 24
  16. 16. Interleaving vs. offline metrics 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Simple Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM Figure: Unjudged documents considered irrelevant 14 / 24
  17. 17. Making use of unjudged documents 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Condensed Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM Figure: Method by Sakai, T. Alternatives to Bpref. SIGIR’2007: unjudged documents skipped (result page is condensed) 15 / 24
  18. 18. Thresholds § Modify offline metric usage protocol. Introduce a threshold δ: MetricSignal “ 1 |Qδ| ÿ qPQδ pMetricBpqq ´ MetricApqqq , where Qδ “ tq P Q | |MetricBpqq ´ MetricApqq| ą δu § Choose a threshold to maximize correlation with interleaving § Use 5 experiments to tune thresholds and 5 thresholds to test. Repeat for each possible 5/5 split (total C5 10 “ 252 splits) 16 / 24
  19. 19. Thresholds 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Thresholded Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 17 / 24
  20. 20. Thresholds+condensation 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Thresholded Condensed Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 18 / 24
  21. 21. All in one 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Simple Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Condensed Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Thresholded Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 0 1 2 3 4 5 6 7 8 9 10 #unjudged 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 correlation Thresholded Condensed Metrics Precision Precision2 DCG uSDBN ERR EBU rrDBN uDCM rrDCM uUBM 19 / 24
  22. 22. Summary § A recipe for turning a click model into a metric § Two families of metrics: utility-based and effort-based § Multi-aspect analysis of the metrics 20 / 24
  23. 23. Key results § Effort-based metrics are substantially different from utility-based ones, even when based on the same user model § Model-based metrics show better agreement with interleaving and better deal with unjudged documents § Using techniques such as condensation and threshold we can improve agreement with interleaving 21 / 24
  24. 24. What’s next? § Judging snippets. Drop the assumption that snippet attractiveness is a function of document relevance as was assumed by the click model-based metrics § Good abandonments. Modify any evaluation metric by adding additional gain from the snippets that contain an answer to the user’s information need 22 / 24
  25. 25. 23 / 24
  26. 26. Bibiography B. Carterette. System effectiveness, user models, and user utility: a conceptual framework for investigation. In SIGIR, 2011. O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In WWW. ACM, 2009. O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM. ACM, 2009. G. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In SIGIR. ACM, 2008. F. Guo, C. Liu, and Y. Wang. Efficient multiple-click models in web search. In WSDM. ACM, 2009. F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. In SIGIR. ACM, 2010. E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. Expected browsing utility for web search evaluation. In CIKM. ACM, 2010. 24 / 24

×