Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

1,814 views
1,727 views

Published on

Aggregating search results from a variety of heterogeneous
sources, so-called verticals, such as news, image and video,
into a single interface is a popular paradigm in web search.
Current approaches that evaluate the effectiveness of aggregated
search systems are based on rewarding systems that
return highly relevant verticals for a given query, where this
relevance is assessed under different assumptions. It is difficult
to evaluate or compare those systems without fully
understanding the relationship between those underlying assumptions.
To address this, we present a formal analysis and
a set of extensive user studies to investigate the effects of various
assumptions made for assessing query vertical relevance.
A total of more than 20,000 assessments on 44 search tasks
across 11 verticals are collected through Amazon Mechanical
Turk and subsequently analysed. Our results provide
insights into various aspects of query vertical relevance and
allow us to explain in more depth as well as questioning the
evaluation results published in the literature.

Work with Ke (Adam) Zhou, Ronan Cummins and Joemon Jose.
Presented at WWW 2013, Rio de Janeiro.

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,814
On SlideShare
0
From Embeds
0
Number of Embeds
446
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

  1. 1. Which  Ver)cal  Search  Engines    are  Relevant?  Understanding  Ver)cal  Relevance  Assessments  for  Web  Queries    Ke  Zhou1,  Ronan  Cummins2,  Mounia  Lalmas3,  Joemon  M.  Jose1  1University  of  Glasgow    2University  of  Greenwich    3Yahoo!  Labs  Barcelona  WWW  2013,  Rio  de  Janeiro  
  2. 2. Aggregated  Search  •  Diverse  search  ver)cals  (image,  video,  news,  etc.)  are  available  on  the  web.  •  Aggrega)ng  (embedding)  ver)cal  results  into  “general  web”  results  has  become  de-­‐facto  in  commercial  web  search  engine.  Ver)cal  search  engines  General  web  search  Mo)va)on  
  3. 3. Aggregated  Search  •  Diverse  search  ver)cals  (image,  video,  news,  etc.)  are  available  on  the  web.  •  Aggrega)ng  (embedding)  ver)cal  results  into  “general  web”  results  has  become  de-­‐facto  in  commercial  web  search  engine.  Ver)cal  search  engines  General  web  search  Mo)va)on  Ver)cal  selec)on  
  4. 4. Evalua)on  of  Aggregated  Search  •  Evalua)on  solely  based  on  ver)cal  selec)on.  •  Compare  system  predic)on  set  against  user  annota%on  set.  •  Annota)on  is  gathered  – Explicitly  (assessing)  – Implicitly  (deriving  from  search  logs)  Mo)va)on  assessor  Topic:  yoga  poses  System  A  System  B  System  C  >  System  B  >  System  A  System  C  
  5. 5. Assessor:  which  ver)cal  search  engines  are  relevant?  •  Defini)on  of  relevance  of  a  ver)cal,  given  a  query,  remains  complex.  – Different  work  makes  different  assump)ons.  – The  underlying  assump)ons  made  may  have  a  major  effect  on  the  evalua)on  of  a  SERP.    •  We  want  to  understand  different  ver)cal  assessment  processes  and  inves)gate  the  impact  of  these.  Mo)va)on  
  6. 6. Assessor:  which  ver)cal  search  engines  are  relevant?  •  Defini)on  of  relevance  of  a  ver)cal,  given  a  query,  remains  complex.  – Different  work  makes  different  assump)ons.  – The  underlying  assump)ons  made  may  have  a  major  effect  on  the  evalua)on  of  a  SERP.    •  We  want  to  understand  different  ver)cal  assessment  processes  and  inves)gate  the  impact  of  these.  Mo)va)on  
  7. 7. (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  pre-­‐retrieval  user-­‐need  
  8. 8. (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  Post-­‐retrieval  user  perspec)ve    
  9. 9. (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  pre-­‐retrieval  user-­‐need  Post-­‐retrieval  user  perspec)ve    
  10. 10. (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  pre-­‐retrieval  user-­‐need  Post-­‐retrieval  user  perspec)ve    
  11. 11. (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  
  12. 12. (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Web-­‐anchor  approach  
  13. 13. (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  14. 14. (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  15. 15. (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  16. 16. (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (one  possible  slot)  –  ToP:  top  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  End  of  SERP  ToP  
  17. 17. (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  MoP  BoP  
  18. 18. (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  End  of  SERP  ToP  MoP  BoP  
  19. 19. (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  End  of  SERP  ToP  MoP  BoP  
  20. 20. Experimental  Design  Overview  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  RQ2  RQ3  
  21. 21. Experimental  Design  Overview  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  RQ2  RQ3  
  22. 22. Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  23. 23. Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  24. 24. Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  25. 25. Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  26. 26. Experimental  Design:  Study  1  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  
  27. 27. Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  28. 28. Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  29. 29. Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  30. 30. Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  31. 31. Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevance•  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Post-­‐retrieval  search  u)lity
  32. 32. Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevance•  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Post-­‐retrieval  search  u)lity
  33. 33. Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  •  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevancePost-­‐retrieval  search  u)lity
  34. 34. Experimental  Design:  Study  2  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ2  
  35. 35. Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.    Experimental  Results  
  36. 36. Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  37. 37. Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  38. 38. Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  39. 39. Study  2  (Dependency  of  Relevance)  Results  •  Not  much  difference  is  observed  by  using  different  anchors  (different  observed  topical  relevance  level).  •  Context  ma`ers  –  The  context  of  other  ver)cals  can  diminish  the  u)lity  of  a  ver)cal.  –  Examples:  (“Answer”,  “Wiki”),  (“Books”,  “Scholar”),  etc.    Experimental  Results  
  40. 40. Study  2  (Dependency  of  Relevance)  Results  •  Not  much  difference  is  observed  by  using  different  anchors  (different  observed  topical  relevance  level).  •  Context  ma`ers  –  The  context  of  other  ver)cals  can  diminish  the  u)lity  of  a  ver)cal.  –  Examples:  (“Answer”,  “Wiki”),  (“Books”,  “Scholar”),  etc.    Experimental  Results  
  41. 41. Experimental  Design:  Study  3  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ3  
  42. 42. Study  3  (Assessment  Grade)  Results  •  Deriving  “perfect”  embedding  posi)on  from  mul)-­‐graded  assessments  •  Thresholding  for  binary  assessment  (User  type  simula)on)  –  Risk-­‐seeking    –  Risk-­‐medium    –  Risk-­‐averse      Experimental  Results  1.0 0.50.75 0.25 0.25 0.0……0.0……Risk-­‐seekingRisk-­‐mediumRisk-­‐averse ToPToPToP MoPMoPBoPBoPMoP BoPNSNSVer)calsMajority  preference
  43. 43. Study  3  (Assessment  Grade)  Results  •  Inter-­‐assessor  agreement  are  moderate  and  different  users  have  different  risk-­‐level.  •  Ver)cal  Relevance  Correla)on  – Most  of  the  binary  approach  significantly  correlates  with  the  mul)-­‐graded  ground-­‐truth,  however  mostly  are  modest.  – Risk-­‐medium  thresholding  approach  performs  best.  Experimental  Results  
  44. 44. Study  3  (Assessment  Grade)  Results  •  Inter-­‐assessor  agreement  are  moderate  and  different  users  have  different  risk-­‐level.  •  Ver)cal  Relevance  Correla)on  – Most  of  the  binary  approach  significantly  correlates  with  the  mul)-­‐graded  ground-­‐truth,  however  mostly  are  modest.  – Risk-­‐medium  thresholding  approach  performs  best.  Experimental  Results  
  45. 45. Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  46. 46. Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  47. 47. Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  48. 48. Conclusions  •  We  compare  different  ver)cal  relevance  assessment  processes  and  analyzed  their  impact.  •  Our  work  has  implica)ons  with  regard  to  "how"  and  "what"  evalua)on  design  decisions  affect  the  actual  evalua)on.    •  This  work  also  creates  a  need  to  re-­‐interpret  previous  evalua)on  efforts  in  this  area.    
  49. 49. Ques)ons?  •  Thanks!  

×