Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries
Upcoming SlideShare
Loading in...5
×
 

Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

on

  • 1,453 views

Aggregating search results from a variety of heterogeneous ...

Aggregating search results from a variety of heterogeneous
sources, so-called verticals, such as news, image and video,
into a single interface is a popular paradigm in web search.
Current approaches that evaluate the effectiveness of aggregated
search systems are based on rewarding systems that
return highly relevant verticals for a given query, where this
relevance is assessed under different assumptions. It is difficult
to evaluate or compare those systems without fully
understanding the relationship between those underlying assumptions.
To address this, we present a formal analysis and
a set of extensive user studies to investigate the effects of various
assumptions made for assessing query vertical relevance.
A total of more than 20,000 assessments on 44 search tasks
across 11 verticals are collected through Amazon Mechanical
Turk and subsequently analysed. Our results provide
insights into various aspects of query vertical relevance and
allow us to explain in more depth as well as questioning the
evaluation results published in the literature.

Work with Ke (Adam) Zhou, Ronan Cummins and Joemon Jose.
Presented at WWW 2013, Rio de Janeiro.

Statistics

Views

Total Views
1,453
Views on SlideShare
967
Embed Views
486

Actions

Likes
1
Downloads
13
Comments
0

3 Embeds 486

http://eventifier.co 420
https://twitter.com 58
http://eventifier.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries Presentation Transcript

  • Which  Ver)cal  Search  Engines    are  Relevant?  Understanding  Ver)cal  Relevance  Assessments  for  Web  Queries    Ke  Zhou1,  Ronan  Cummins2,  Mounia  Lalmas3,  Joemon  M.  Jose1  1University  of  Glasgow    2University  of  Greenwich    3Yahoo!  Labs  Barcelona  WWW  2013,  Rio  de  Janeiro  
  • Aggregated  Search  •  Diverse  search  ver)cals  (image,  video,  news,  etc.)  are  available  on  the  web.  •  Aggrega)ng  (embedding)  ver)cal  results  into  “general  web”  results  has  become  de-­‐facto  in  commercial  web  search  engine.  Ver)cal  search  engines  General  web  search  Mo)va)on  
  • Aggregated  Search  •  Diverse  search  ver)cals  (image,  video,  news,  etc.)  are  available  on  the  web.  •  Aggrega)ng  (embedding)  ver)cal  results  into  “general  web”  results  has  become  de-­‐facto  in  commercial  web  search  engine.  Ver)cal  search  engines  General  web  search  Mo)va)on  Ver)cal  selec)on  
  • Evalua)on  of  Aggregated  Search  •  Evalua)on  solely  based  on  ver)cal  selec)on.  •  Compare  system  predic)on  set  against  user  annota%on  set.  •  Annota)on  is  gathered  – Explicitly  (assessing)  – Implicitly  (deriving  from  search  logs)  Mo)va)on  assessor  Topic:  yoga  poses  System  A  System  B  System  C  >  System  B  >  System  A  System  C  
  • Assessor:  which  ver)cal  search  engines  are  relevant?  •  Defini)on  of  relevance  of  a  ver)cal,  given  a  query,  remains  complex.  – Different  work  makes  different  assump)ons.  – The  underlying  assump)ons  made  may  have  a  major  effect  on  the  evalua)on  of  a  SERP.    •  We  want  to  understand  different  ver)cal  assessment  processes  and  inves)gate  the  impact  of  these.  Mo)va)on  
  • Assessor:  which  ver)cal  search  engines  are  relevant?  •  Defini)on  of  relevance  of  a  ver)cal,  given  a  query,  remains  complex.  – Different  work  makes  different  assump)ons.  – The  underlying  assump)ons  made  may  have  a  major  effect  on  the  evalua)on  of  a  SERP.    •  We  want  to  understand  different  ver)cal  assessment  processes  and  inves)gate  the  impact  of  these.  Mo)va)on  
  • (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  pre-­‐retrieval  user-­‐need  
  • (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  Post-­‐retrieval  user  perspec)ve    
  • (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  pre-­‐retrieval  user-­‐need  Post-­‐retrieval  user  perspec)ve    
  • (RQ1)  Assump)ons:  user  perspec)ve  •  Pre-­‐retrieval:  –  Ver%cal  Orienta%on:  before  issuing  the  query,  the  user  thinks  about  which  ver)cals  might  provide  be`er  results.  •  Post-­‐retrieval:    –  Aaer  viewing  search  results,  the  user  considers  which  ver)cal  provides  be`er  results.  •  Influencing  factors  –  Ver)cal  orienta)on  (type  preference)  –  Within-­‐ver)cal  ranking  •  Serendipity    –  Visual  a`rac)veness  Problem  and  Previous  Work  ……  pre-­‐retrieval  user-­‐need  Post-­‐retrieval  user  perspec)ve    
  • (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  
  • (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Web-­‐anchor  approach  
  • (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  • (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  • (RQ2)  Assump)ons:  dependency  of  relevance    •  Inter-­‐dependent  approach:    –  quality  of  ver)cals  is  rela)ve  and  dependent  on  each  other.  •  Web-­‐anchor  approach:    –  quality  of  “general  web”  serves  as  a  reference  criteria  for  deciding  relevance.  •  Context    –  Does  the  context  (results  returned  from  other  ver)cals)  affect  a  user’s  percep)on  of  the  relevance  of  the  ver)cal  of  interest?  •  U)lity  vs.  Effort  Problem  and  Previous  Work  Inter-­‐dependent  approach  Web-­‐anchor  approach  
  • (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (one  possible  slot)  –  ToP:  top  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  End  of  SERP  ToP  
  • (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  MoP  BoP  
  • (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  End  of  SERP  ToP  MoP  BoP  
  • (RQ3)  Assump)ons:  assessment  grade  •  Binary  (pairwise)  preference  •  Mul)-­‐grade  preference  •  SERP  (three  possible  slots)  –  ToP:  top  of  the  page  –  MoP:  middle  of  the  page  –  BoP:  bo`om  of  the  page  –  NS:  not  shown  •  Is  the  binary  (pairwise)  preference  informa)on  provided  by  a  popula)on  of  users  able  to  predict  the  “perfect”  embedding  posi)on  of  a  ver)cal?  Problem  and  Previous  Work  Binary  preference  (ToP  or  NS)  Mul)-­‐grade  preference  (ToP,  MoP,  BoP  or  NS)  End  of  SERP  ToP  End  of  SERP  ToP  MoP  BoP  
  • Experimental  Design  Overview  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  RQ2  RQ3  
  • Experimental  Design  Overview  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  RQ2  RQ3  
  • Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  • Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  • Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  • Experiment  Design  Details  •  Crowd-­‐sourcing  Data  Collec)on  –  We  hire  crowd-­‐sourced  workers  on  Amazon  Mechanical  Turk  to  make  assessments.  •  Ver)cals  –  Cover  a  variety  of  11  ver)cals  employed  by  three  major  commercial  search  engines.  –  Use  exis)ng  commercial  ver)cal  search  engines.  •  Search  Tasks  –  44  tasks  cover  a  variety  of  (ver)cal)  intents    –  Come  from  exis)ng  aggregated  search  collec)on  (TREC)  •  Quality  Control  –  4  assessment  points  for  one  manipula)on  –  Trap  HITs  (assessment  page  with  results  of  other  queries)  –  Trap  search  tasks  (assessment  page  with  explicit  ver)cal  request)  Experimental  Design  
  • Experimental  Design:  Study  1  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ1  
  • Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  • Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  • Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  • Study  1  Results:    pre-­‐retrieval  vs.  post-­‐retrieval  •  Both  pre-­‐retrieval  and  post-­‐retrieval  inter-­‐assessor  agreements  are  moderate  and  assessors  have  the  similar  level  of  difficulty  in  assessing  for  both.  •  Ver)cal  relevance  are  moderately  (but  significantly)  correlated  (0.53)  between  pre-­‐retrieval  and  post-­‐retrieval.  •  Highly  relevant  ver)cals  derived  from  pre-­‐retrieval  and  post-­‐retrieval  overlap  significantly.  –  Almost  60%  overlap  on  at  least  2  out  of  3  top  ver)cals  •  There  is  a  bias  in  visually  salient  ver)cals  for  post-­‐retrieval  search  u%lity.    Experimental  Results  
  • Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevance•  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Post-­‐retrieval  search  u)lity
  • Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevance•  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Post-­‐retrieval  search  u)lity
  • Study  1  Results:    topical  relevance  vs.  pre-­‐retrieval  orienta)on  •  Ver)cal  relevance  between  pre-­‐retrieval  orienta)on  and  post-­‐retrieval  is  moderately  correlated  (0.53).  •  Ver)cal  relevance  between  topical-­‐relevance  and  post-­‐retrieval  is  weakly  correlated  (0.36).  •  Impact  of  pre-­‐retrieval  orienta%on  is  more  important  for  post-­‐retrieval  search  u%lity,  compared  with  post-­‐retrieval  topical  relevance.  Experimental  Results  N R NRRNnDCG(vi)nDCG(w)  ……  Orienta)onTopical  relevancePost-­‐retrieval  search  u)lity
  • Experimental  Design:  Study  2  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ2  
  • Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.    Experimental  Results  
  • Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  • Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  • Study  2  (Dependency  of  Relevance)  Results  •  Both  inter-­‐assessor  agreements  are  moderate  and  there  is  not  much  difference  between  the  user  agreement  for  both  approaches.  •  Ver)cal  relevance  correla)on  between  inter-­‐dependent  and  web-­‐anchor  approach  is  moderate  (0.573).  •  The  overlap  of  top-­‐three  relevant  ver)cals  between  two  approaches  is  quite  high.  –  More  than  70%  overlap  on  2  out  of  3  top  ver)cals.  •  Web-­‐anchor  approach  provides  be`er  trade-­‐off  between  u)lity  and  effort.  Experimental  Results  
  • Study  2  (Dependency  of  Relevance)  Results  •  Not  much  difference  is  observed  by  using  different  anchors  (different  observed  topical  relevance  level).  •  Context  ma`ers  –  The  context  of  other  ver)cals  can  diminish  the  u)lity  of  a  ver)cal.  –  Examples:  (“Answer”,  “Wiki”),  (“Books”,  “Scholar”),  etc.    Experimental  Results  
  • Study  2  (Dependency  of  Relevance)  Results  •  Not  much  difference  is  observed  by  using  different  anchors  (different  observed  topical  relevance  level).  •  Context  ma`ers  –  The  context  of  other  ver)cals  can  diminish  the  u)lity  of  a  ver)cal.  –  Examples:  (“Answer”,  “Wiki”),  (“Books”,  “Scholar”),  etc.    Experimental  Results  
  • Experimental  Design:  Study  3  •  Manipula)on  (Independent)  Variables  –  Search  Tasks  –  Ver)cals  of  Interest  –  User  Perspec)ve  (Study  1:  RQ1)  –  Dependency  of  Relevance  (Study  2:  RQ2)  –  Assessment  Grade  (Study  3:  RQ3)  Experimental  Design  •  Dependent  Variables  –  Inter-­‐assessor  Agreement  •  Measured  by  Fleiss’  Kappa  (KF)  –  Ver)cal  Relevance  Correla)on  •  Measured  by  Spearman  Correla)on  RQ3  
  • Study  3  (Assessment  Grade)  Results  •  Deriving  “perfect”  embedding  posi)on  from  mul)-­‐graded  assessments  •  Thresholding  for  binary  assessment  (User  type  simula)on)  –  Risk-­‐seeking    –  Risk-­‐medium    –  Risk-­‐averse      Experimental  Results  1.0 0.50.75 0.25 0.25 0.0……0.0……Risk-­‐seekingRisk-­‐mediumRisk-­‐averse ToPToPToP MoPMoPBoPBoPMoP BoPNSNSVer)calsMajority  preference
  • Study  3  (Assessment  Grade)  Results  •  Inter-­‐assessor  agreement  are  moderate  and  different  users  have  different  risk-­‐level.  •  Ver)cal  Relevance  Correla)on  – Most  of  the  binary  approach  significantly  correlates  with  the  mul)-­‐graded  ground-­‐truth,  however  mostly  are  modest.  – Risk-­‐medium  thresholding  approach  performs  best.  Experimental  Results  
  • Study  3  (Assessment  Grade)  Results  •  Inter-­‐assessor  agreement  are  moderate  and  different  users  have  different  risk-­‐level.  •  Ver)cal  Relevance  Correla)on  – Most  of  the  binary  approach  significantly  correlates  with  the  mul)-­‐graded  ground-­‐truth,  however  mostly  are  modest.  – Risk-­‐medium  thresholding  approach  performs  best.  Experimental  Results  
  • Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  • Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  • Final  take-­‐out•  Study  1  –  Assessing  for  aggregated  search  is  difficult.  –  Highly  relevant  ver)cals  overlaps  significantly  for  pre-­‐retrieval  and  post-­‐retrieval  user  perspec)ves.  –  Ver)cal  (type)  orienta)on  is  more  important  than  topical  relevance.  •  Study  2  –  Anchor-­‐based  approach  might  be  a  be`er  approach  than  inter-­‐dependent  approach,  with  respect  to  u)lity-­‐effort  trade-­‐off.    –  Context  ma`ers.  •  Study  3  –  Binary  approach  can  be  used  to  determine  “perfect”  embedding  posi)on  of  the  ver)cals  and  it  performs  rela)vely  well  with  not  a  lot  of  assessments.  
  • Conclusions  •  We  compare  different  ver)cal  relevance  assessment  processes  and  analyzed  their  impact.  •  Our  work  has  implica)ons  with  regard  to  "how"  and  "what"  evalua)on  design  decisions  affect  the  actual  evalua)on.    •  This  work  also  creates  a  need  to  re-­‐interpret  previous  evalua)on  efforts  in  this  area.    
  • Ques)ons?  •  Thanks!