On	
  the	
  Reliability	
  and	
  Intui0veness	
  of	
  
Aggregated	
  Search	
  Metrics	
  
	
  

Ke	
  Zhou1,	
  Mounia...
Background	
  

Aggregated	
  Search	
  
•  Diverse	
  search	
  verNcals	
  
(image,	
  video,	
  news,	
  etc.)	
  
are	...
Background	
  

Aggregated	
  Search	
  
•  Diverse	
  search	
  verNcals	
  
(image,	
  video,	
  news,	
  etc.)	
  
are	...
Background	
  

Background	
  

Architecture	
  of	
  Aggregated	
  Search	
  

(RP)	
  Result	
  
Presenta0on	
  

query	...
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model...
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model...
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model...
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model...
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model...
Overview	
  
Overview	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS...
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS...
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS...
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS...
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS...
Overview	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
 ...
Overview	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  Clu...
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  Clu...
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  Clu...
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  Clu...
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  Clu...
Overview	
  
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robus...
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robus...
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robus...
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robus...
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robus...
Results	
  

DiscriminaNve	
  Power	
  Results
•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to...
Results	
  

DiscriminaNve	
  Power	
  Results
•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to...
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  AS...
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  AS...
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  AS...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
...
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
...
Overview	
  
Methodology	
  

Concordance	
  Test	
  (IntuiNveness)
•  Highly	
  discriminaNve	
  
metrics,	
  while	
  desirable,	
  
...
Methodology	
  

Concordance	
  Test	
  (IntuiNveness)	
  
•  Highly	
  discriminaNve	
  
metrics,	
  while	
  desirable,	...
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordan...
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordan...
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordan...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  ...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  ...
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  ...
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  di...
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  di...
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  di...
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  di...
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
 ...
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
 ...
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
 ...
Upcoming SlideShare
Loading in …5
×

On the Reliability and Intuitiveness of Aggregated Search Metrics

623
-1

Published on

Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user’s perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
623
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

On the Reliability and Intuitiveness of Aggregated Search Metrics

  1. 1. On  the  Reliability  and  Intui0veness  of   Aggregated  Search  Metrics     Ke  Zhou1,  Mounia  Lalmas2,  Tetsuya  Sakai3,  Ronan  Cummins4,  Joemon  M.  Jose1   1University  of  Glasgow     2Yahoo  Labs  London   3Waseda  University     4University  of  Greenwich       CIKM  2013,  San  Francisco  
  2. 2. Background   Aggregated  Search   •  Diverse  search  verNcals   (image,  video,  news,  etc.)   are  available  on  the  web.   •  AggregaNng  (embedding)   verNcal  results  into   “general  web”  results  has   become  de-­‐facto  in   commercial  web  search   engine.   VerNcal   search   engines   General   web   search  
  3. 3. Background   Aggregated  Search   •  Diverse  search  verNcals   (image,  video,  news,  etc.)   are  available  on  the  web.   •  AggregaNng  (embedding)   verNcal  results  into   “general  web”  results  has   become  de-­‐facto  in   commercial  web  search   engine.   VerNcal   selecNon   VerNcal   search   engines   General   web   search  
  4. 4. Background   Background   Architecture  of  Aggregated  Search   (RP)  Result   Presenta0on   query   (IS)  Item   Selec0on   (VS)  Ver0cal   Selec0on   IS   Aggregated  search  system   query   VS   RP   Image   VerNcal   query   Blog   VerNcal   query   Wiki   (Encyclopedia)   VerNcal   query   ……   query   Shopping   VerNcal   General  Web   VerNcal  
  5. 5. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  6. 6. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  7. 7. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  8. 8. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  9. 9. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  10. 10. Overview  
  11. 11. Overview  
  12. 12. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  13. 13. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  14. 14. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  15. 15. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  16. 16. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  17. 17. Overview  
  18. 18. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  19. 19. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  20. 20. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  21. 21. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   posiNon  discounted     vs.  set-­‐based     •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  22. 22. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  novelty  vs.     orientaNon  vs.     diversity   VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  23. 23. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  posiNon  vs.     user  tolerance  vs.     cascade   VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  24. 24. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page   key  components:   VS  vs.  IS.  vs.  RP  vs.  VD  
  25. 25. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page   Standard  parameter  secngs    [Zhou  et  al.  SIGIR’12] K.  Zhou,  R.  Cummins,  M.  Lalmas  and  J.M.  Jose.  EvaluaNng  aggregated  search  pages.  In  SIGIR,  115-­‐124,  2012.
  26. 26. Overview  
  27. 27. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  28. 28. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  29. 29. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  30. 30. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  31. 31. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)  -­‐>  the  one  that  we  will  report  our  experiments  on   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks  -­‐>  50  topics   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  32. 32. Overview  
  33. 33. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  34. 34. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  35. 35. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  36. 36. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  37. 37. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  38. 38. Results   DiscriminaNve  Power  Results •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  39. 39. Results   DiscriminaNve  Power  Results •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) each  curve:     one  metric   X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  40. 40. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  41. 41. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  42. 42. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  43. 43. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  44. 44. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  45. 45. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  46. 46. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  47. 47. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  48. 48. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  49. 49. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  50. 50. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  51. 51. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  52. 52. Overview  
  53. 53. Methodology   Concordance  Test  (IntuiNveness) •  Highly  discriminaNve   metrics,  while  desirable,   may  not  necessarily   measure  everything  that   we  may  want  measured.     •  Understanding  how  each   key  component  is   captured  by  the  metric   –  Context  of  AS   •  VS,  VD,  IS,  RP  
  54. 54. Methodology   Concordance  Test  (IntuiNveness)   •  Highly  discriminaNve   metrics,  while  desirable,   may  not  necessarily   measure  everything  that   we  may  want  measured.     •  Understanding  how  each   key  component  is   captured  by  the  metric   –  Context  of  AS   (VS)  VerNcal   SelecNon:   select  correct   verNcals (VD)  VerNcal   diversity:  promote   mulNple  verNcal   results (RP)  Result   PresentaNon:   embed  verNcals   correctly ……   •  VS,  VD,  IS,  RP   (IS)  Item  SelecNon:   select  relevant  items
  55. 55. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Simple  Metric
  56. 56. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Simple  Metric
  57. 57. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Single-­‐component   Simple  Metric
  58. 58. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  59. 59. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  60. 60. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  61. 61. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  62. 62. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  63. 63. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  64. 64. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  65. 65. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  66. 66. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  67. 67. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  68. 68. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  69. 69. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  70. 70. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  71. 71. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!
  72. 72. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!
  73. 73. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×