On the Reliability and Intuitiveness of Aggregated Search Metrics

1.
On the Reliability and Intui0veness of Aggregated Search Metrics Ke Zhou1, Mounia Lalmas2, Tetsuya Sakai3, Ronan Cummins4, Joemon M. Jose1 1University of Glasgow 2Yahoo Labs London 3Waseda University 4University of Greenwich CIKM 2013, San Francisco

2.
Background Aggregated Search •  Diverse search verNcals (image, video, news, etc.) are available on the web. •  AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine. VerNcal search engines General web search

3.
Background Aggregated Search •  Diverse search verNcals (image, video, news, etc.) are available on the web. •  AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine. VerNcal selecNon VerNcal search engines General web search

4.
Background Background Architecture of Aggregated Search (RP) Result Presenta0on query (IS) Item Selec0on (VS) Ver0cal Selec0on IS Aggregated search system query VS RP Image VerNcal query Blog VerNcal query Wiki (Encyclopedia) VerNcal query …… query Shopping VerNcal General Web VerNcal

5.
MoNvaNon EvaluaNng the EvaluaNon (Meta-‐evaluaNon) •  Aggregated Search (AS) Metrics –  model four AS compounding factors –  diﬀerences: the way they model each factor and combine them. –  How well the metrics capture and combine those factors remain poorly understood. •  Focus: we meta-‐evaluate AS metrics –  Reliability •  ability to detect “actual” performance diﬀerences. –  IntuiNveness •  ability to capture any property deemed important (AS component).

6.

7.

8.

9.

10.
Overview

11.
Overview

12.
Factors Compounding Factors •  (VS) VerNcal SelecNon •  (IS) Item SelecNon •  •  •  •  VS(A>B,C): image preference IS(C>A,B): more relevant items RP (B>A,C): relevant items at top VD (C>A,B): diverse informaNon MoNvaNon •  (RP) Result PresentaNon •  (VD) VerNcal Diversity

13.

14.

15.

16.
Factors Compounding Factors •  (VS) VerNcal SelecNon •  (IS) Item SelecNon •  •  •  •  VS(A>B,C): image preference IS(C>A,B): more relevant items RP (B>A,C): relevant items at top VD (C>A,B): diverse informaNon •  (RP) Result PresentaNon •  (VD) VerNcal Diversity

17.
Overview

18.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page

19.

20.

21.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page posiNon discounted vs. set-‐based •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page

22.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  novelty vs. orientaNon vs. diversity VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page

23.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  posiNon vs. user tolerance vs. cascade VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page

24.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page key components: VS vs. IS. vs. RP vs. VD

25.
Metrics Metrics • TradiNonal IR –  homogeneous ranked list •  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page •  Aggregated Search –  uNlity-‐eﬀort aware framework •  Single AS component –  –  –  –  VS: verNcal precision VD: verNcal (intent) recall IS: mean precision of verNcal items RP: Spearman’s correlaNon with the “ideal” AS page Standard parameter secngs [Zhou et al. SIGIR’12] K. Zhou, R. Cummins, M. Lalmas and J.M. Jose. EvaluaNng aggregated search pages. In SIGIR, 115-‐124, 2012.

26.
Overview

27.
Experiment Setup • Two Aggregated Search test collecNons –  VertWeb’11 (classifying ClueWeb09 collecNon) –  FedWeb’13 (TREC) •  VerNcals –  Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.) •  Topics and Assessments –  Reusing topics from TREC web and millionquery tracks –  VerNcal orientaNon assessments (type of informaNon) –  Topical relevance assessments of items (tradiNonal document relevance) •  Simulated AS systems –  implement state-‐of-‐the-‐art AS components –  vary component system of combinaNon for ﬁnal AS system –  36 AS systems in total Experimental Setup

28.

29.

30.

31.
Experiment Setup • Two Aggregated Search test collecNons –  VertWeb’11 (classifying ClueWeb09 collecNon) –  FedWeb’13 (TREC) -‐> the one that we will report our experiments on •  VerNcals –  Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.) •  Topics and Assessments –  Reusing topics from TREC web and millionquery tracks -‐> 50 topics –  VerNcal orientaNon assessments (type of informaNon) –  Topical relevance assessments of items (tradiNonal document relevance) •  Simulated AS systems –  implement state-‐of-‐the-‐art AS components –  vary component system of combinaNon for ﬁnal AS system –  36 AS systems in total Experimental Setup

32.
Overview

33.
Methodology DiscriminaNve Power (Reliability) •  DiscriminaNve power –  reflect metrics’ robustness to variaNon across topics. –  measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs. •  Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] –  use the observed data and computaNonal power to esNmate the distribuNons. –  conservaNve nature B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.

34.
Methodology DiscriminaNve Power (Reliability) •  DiscriminaNve power –  reflect metrics’ robustness to variaNon across topics. –  measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs. •  Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] –  use the observed data and computaNonal power to esNmate the distribuNons. –  conservaNve nature B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.

35.
Methodology DiscriminaNve Power (Reliability) •  DiscriminaNve power –  reflect metrics’ robustness to variaNon across topics. –  measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs. •  Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] –  use the observed data and computaNonal power to esNmate the distribuNons. –  conservaNve nature Main idea: if the largest mean difference of systems observed is not significant, then none of the other differences should be significant either. B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.

36.

37.

38.
Results DiscriminaNve Power Results •  The most discriminaNve metrics are those closer to the origin in the ﬁgures. •  TradiNonal & Single component << Adapted diversity & Aggregated search Y-‐axis: ASL (p-‐value: 0 to 0.10) X-‐axis: run pairs sorted by ASL ASL: Achieved Signiﬁcance Level Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”

39.
Results DiscriminaNve Power Results •  The most discriminaNve metrics are those closer to the origin in the ﬁgures. Y-‐axis: ASL (p-‐value: 0 to 0.10) each curve: one metric X-‐axis: run pairs sorted by ASL ASL: Achieved Signiﬁcance Level •  TradiNonal & Single component << Adapted diversity & Aggregated search Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”

40.
Results DiscriminaNve Power Results tradiNonal IR and single component metrics Y-‐axis: ASL (p-‐value: 0 to 0.10) adapted diversity and aggregated search metrics X-‐axis: run pairs sorted by ASL ASL: Achieved Signiﬁcance Level •  The most discriminaNve metrics are those closer to the origin in the ﬁgures. •  TradiNonal & Single component << Adapted diversity & Aggregated search Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”

41.

42.

43.
Results DiscriminaNve Power Results Single component & TradiNonal Y-‐axis: ASL (p-‐value) X-‐axis: run pairs sorted by ASL VS << VD << (IS, P@10) << (nDCG, RP) •  Single-‐component metrics perform comparaNvely well. •  RP metric is the most discriminaNve single-‐component metric. •  VS metric is the least discriminaNve single-‐component metric. •  nDCG performs beoer than P@10 and other single-‐component metrics.

44.

45.

46.

47.

48.
Results DiscriminaNve Power Results Adapted diversity & Aggregated search Y-‐axis: ASL (p-‐value) IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR •  AS-‐metrics (uNlity-‐eﬀort) are generally more discriminaNve than other adapted diversity metrics. •  ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based). X-‐axis: run pairs sorted by ASL •  IA-‐nDCG (orientaNon emphasized) and D#-‐ nDCG (diversity emphasized) are the least discriminaNve metrics.

49.

50.

51.

52.
Overview

53.
Methodology Concordance Test (IntuiNveness) •  Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured. •  Understanding how each key component is captured by the metric –  Context of AS •  VS, VD, IS, RP

54.
Methodology Concordance Test (IntuiNveness) •  Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured. •  Understanding how each key component is captured by the metric –  Context of AS (VS) VerNcal SelecNon: select correct verNcals (VD) VerNcal diversity: promote mulNple verNcal results (RP) Result PresentaNon: embed verNcals correctly …… •  VS, VD, IS, RP (IS) Item SelecNon: select relevant items

55.
Methodology Concordance Test [Sakai, WWW’12] •  Concordance test –  Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric –  Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy. –  Four simple gold-‐standard metrics •  VS, VD, IS, RP •  simple and therefore agnosNc to metric diﬀerences (e.g. diﬀerent posiNon-‐based discounNng) T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012. disagree Metric 1 Metric 2 concordance 60% 40% Gold-‐standard Simple Metric

56.
Methodology Concordance Test [Sakai, WWW’12] •  Concordance test –  Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric –  Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy. –  Four simple gold-‐standard metrics •  VS, VD, IS, RP •  simple and therefore agnosNc to metric diﬀerences (e.g. diﬀerent posiNon-‐based discounNng) T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012. disagree Metric 1 Metric 2 concordance 60% 40% Gold-‐standard Simple Metric

57.
Methodology Concordance Test [Sakai, WWW’12] •  Concordance test –  Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric –  Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy. –  Four simple gold-‐standard metrics •  VS, VD, IS, RP •  simple and therefore agnosNc to metric diﬀerences (e.g. diﬀerent posiNon-‐based discounNng) T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012. disagree Metric 1 Metric 2 concordance 60% 40% Gold-‐standard Single-‐component Simple Metric

58.
Results Concordance Test Results Capturing each individual key AS component •  Concordance with VS: -  IA-‐nDCG > ASRBP > ASDCG > D#-‐nDCG > ASERR, α-‐nDCG -  Intent-‐aware (IA) metric (orientaNon emphasized) and AS-‐ metrics (uNlity-‐eﬀort) perform best. •  Concordance with VD: -  D#-‐nDCG > IA-‐nDCG > ASDCG, ASRBP , ASERR > α-‐nDCG -  D# (diversity emphasized) and IA (orientaNon emphasized) frameworks work best. Let “M1 > M2”denotes “M1 staNsNcally signiﬁcantly outperforms M2 in terms of concordance with a given gold-‐standard metric.”

59.

60.

61.
Results Concordance Test Results Capturing each individual key AS component •  Concordance with IS: -  ASRBP , D#-‐nDCG > ASDCG > IA-‐nDCG > ASERR > α-‐nDCG; -  ASRBP (tolerance-‐based AS Metric) and D# (diversity emphasized) metrics perform best. •  Concordance with RP: -  α-‐nDCG > ASERR > ASDCG > ASRBP > D#-‐nDCG > IA-‐nDCG. -  α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics work best. •  However, α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics consistently perform worst with respect to VS, VD and IS.

62.

63.

64.
Results Concordance Test Results Capturing mulNple key AS components •  Concordance with VS and IS: -  ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG; •  Concordance with VS, VD and IS: -  D#-‐nDCG > ASRBP , IA-‐nDCG > ASDCG > ASERR > α-‐nDCG; •  Concordance with all (VS, VD, IS and RP): -  ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG. •  ASRBP (tolerance-‐based AS Metric) and D#-‐nDCG (diversity emphasized) perform best when combining all components. •  There are advantages of metrics that capture key components of AS (e.g. VS) over those that do not (e.g. α-‐nDCG).

65.

66.

67.
Conclusions Final take-‐out • In terms of discriminaNve power, –  RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components. –  AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics. •  In terms of intuiNveness, –  Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components. •  Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric. •  We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.

68.

69.

70.

71.
Future Future Work •  comparison with meta-‐evaluaNon results from human subjects to test the reliability of our approach and results. •  propose a more principled evaluaNon framework to incorporate and combine key AS factors (VS, VD, IS, RP). •  Welcome to parNcipate TREC FedWeb 2014 task (conNnuaNon of FedWeb 2013: hops://sites.google.com/site/trecfedweb/)!

72.

73.

On the Reliability and Intuitiveness of Aggregated Search Metrics

More Related Content

Similar to On the Reliability and Intuitiveness of Aggregated Search Metrics

More from Mounia Lalmas-Roelleke

Recently uploaded

On the Reliability and Intuitiveness of Aggregated Search Metrics