Evangelos Kanoulas "Advances in Information Retrieval Evaluation"

362 views

Published on

22 августа, семинар "RUSSIR Summer School Best Practices"
Evangelos Kanoulas "Advances in Information Retrieval Evaluation"

There is great interest in producing effectiveness measures that model user behavior in order to better model the utility of a system to its users. These measures are often formulated as a sum over the product of a discount function of ranks and a gain function mapping relevance assessments to numeric utility values. We develop a conceptual framework for analyzing such effectiveness measures based on classifying members of this broad family of measures into four distinct families, each of which reflects a different notion of system utility. This is a theory of model-based measures within which we can hypothesize about the properties that such a measure should have and test those hypotheses against user and system data.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Evangelos Kanoulas "Advances in Information Retrieval Evaluation"

  1. 1. Evalua&ng  Mul&-­‐Query  Sessions  Evangelos  Kanoulas*,  Ben  Cartere9e+,  Paul  Clough*,  Mark  Sanderson$   *  University  of  Sheffield,  UK  +  University  of  Delaware,  USA   $  RMIT  University,  Australia  
  2. 2. Why  sessions?  •  Current  evalua&on  framework   –  Assesses  the  effec&veness  of  systems  over  one-­‐ shot  queries  •  Users  reformulate  their  ini&al  query  •  S&ll  fine  if  …   –  op&mizing  system  for  one-­‐shot  queries  led  to   op&mal  performance  over  an  en&re  session    
  3. 3. Why  sessions?   When was the DuPont Science Essay Contest created? Ini&al  Query  : DuPont Science Essay Contest Reformula&on  :  When was the DSEC created?•  e.g.  retrieval  systems  should  accumulate   informa&on  along  a  session  
  4. 4. Extend  the  evalua&on  framework          From  one  query  evalua&on              To  mul&-­‐query  sessions  evalua&on  
  5. 5. Construct  appropriate  test  collec&ons   Rethink  of  evalua&on  measures  
  6. 6. What  is  the  appropriate  collec&on?  
  7. 7. Test  collec&ons  we  built…  •  Text  REtrieval  Conference  (TREC)   –  sponsored  by  NIST   –  many  compe&&ons;  among  them        Session  Track  2010,  2011,  …  
  8. 8. Test  collec&on  we  built  in  2010…  •  Corpus:  ClueWeb09   –  1  billion  web  pages  (5TB  compressed)  •  Queries  and  Reformula&ons   –  150  query  pairs:  ini$al  query,  reformula$on   –  3  types  of  reformula&ons  (not  disclosed  to   par&cipants)   •  Specifica&on  (52  query  pairs)   •  Generaliza&on  (48  query  pairs)   •  Drifing  /  Parallel  Reformula&on  (50  query  pairs)  
  9. 9. Some  Cri&cism…  •  Ar&ficial  reformula&ons  •  Short  reformula&ons   –  just  2  queries  •  No  other  user  interac&on  data   –  clicks,  dwell  &mes,  etc.  •  Reformula&ons  are  sta&c  (do  not  depend  on  the   SE’s  response)   –  The  collec&on  does  not  allow  early  abandonment   –  The  reformula&on  itself  does  not  change  up  on  SE’s   response  
  10. 10. Test  Collec&on  in  2011  •  Corpus:  ClueWeb09   –  1  billion  web  pages  (5TB  compressed)  •  Queries  and  Reformula&ons   –  Real  users  searching  ClueWeb09   –  76  sessions  of  2  up  10  reformula&ons  •  Other  interac&ons   –  Clicks,  dwell  &mes,  mouse  movements,  relevance   judgments  •  But…  reformula&ons  are  s&ll  sta&c  
  11. 11. Basic  test  collec&on  •  A  set  of  informa&on  needs  What do we know about black powder ammunition? –  A  sta&c  sequence  of  m  queries   Ini&al  Query  :   black powder ammunition 1st  Reformula&on  :   black powder wiki gun powder wiki 2nd  Reformula&on  :   …   … (m-­‐1)th  Reformula&on  :   history of gunpowder
  12. 12. Experiment   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9  10  …  
  13. 13. Evalua&on  over  a  single  ranked  list   Experiment   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9   10   …  
  14. 14. Construct  appropriate  test  collec&ons   Rethink  of  evalua&on  measures  
  15. 15. What  is  a  good  system?  
  16. 16. How  can  we  measure  “goodness”?  
  17. 17. Measuring  “goodness”   The  user  steps  down  a  ranked  list  of  documents  and   observes  each  one  of  them  un&l  a  decision  point   and  either   a)   abandons  the  search,  or   b)   reformulates    While  stepping  down  or  sideways,  the  user  accumulates  u&lity    
  18. 18. What  are  the  challenges?  
  19. 19. Evalua&on  oover  aul&ple  ranked  lists   Evalua&on   ver  m  single   ist   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9   10   …  
  20. 20. Exis&ng  measures  •  Session  DCG  [Järvelin  et  al  ECIR  2008]   The  user  steps  down  the  ranked  list  un&l  rank  k  and   reformulates  [Determinis&c;  no  early  abandonment]  •  Expected  session  u&lity  [Yang  and  Lad  ICTIR  2009]   The  user  steps  down  a  ranked  list  of  documents  un&l   a  decision  point  and  reformulates  [Stochas&c;  no   early  abandonment]  
  21. 21. Evalua&ng  over  paths  Op&mize        Model-­‐free  measures  Integrate  out        Model-­‐based  measures  
  22. 22. Evalua&on  measures  •  Evalua&ng  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  23. 23. Model-­‐free  measures   The  user  is  an  oracle  that  knows  when  to   reformulate  Ω(k,j)  :  paths  of  length  k,  ending  at  reformula&on  j   Count  number  of  relevant  docs  on  the  op&mal  path   ω  of  length  k  ending  at  query  j  
  24. 24. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   ω(10,3)  :  length  10,  ending  at  3rd  query   N   R   R   Define  :   N   R   R   N   R   R   N   R   R   Precision@k,j   N   N   R   Recall@k,j   N   N   R   Precision@recall,j   N   N   R   N   N   R   N   N   R   …   …   …  
  25. 25. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   precision N   R   R   N   R   R   N   N   R   N   N   R   N   N   R   ref orm N   N   R   ula tio N   N   R   all n rec …   …   …  
  26. 26. Model-­‐free  measures   Q1   Q2   Q3   N   R   R   N   R   R   ranking 1 ranking 2 ranking 3 N   R   R   1.0 1.0 1.0 N   R   R   0.8 0.8 0.8 N   R   R   0.6 0.6 0.6precision precision precision N   N   R   0.4 0.4 0.4 0.2 0.2 0.2 N   N   R   0.0 0.0 0.0 N   0.0 0.2 N   0.4 R   0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall recall recall N   N   R   N   N   R   …   …   …  
  27. 27. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   precision N   R   R   N   R   R   N   N   R   N   N   R   N   N   R   ref orm N   N   R   ula tio N   N   R   all n rec …   …   …  
  28. 28. Evalua&on  measures  •  Evalua&ng  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  29. 29. Model-­‐based  measures   Probabilis&c  space  of  users  following     different  paths  •  Ω  is  the  space  of  all  paths  •  P(ω)  is  the  prob  of  a  user  following  a  path  ω  in  Ω  •  Mω  is  a  measure  over  a  path  ω   esM = P (ω)Mω ω∈Ω [Yang  and  Lad  ICTIR  2009]  
  30. 30. Model  Browsing  Behavior   black powder ammunition 1   Posion-­‐based  models   2   3   4   The  chance  of  observing  a   5   document  depends  on  the  posion   6   7   of  the  document  in  the  ranked  list.   8   9  10  …  
  31. 31. Rank  Biased  Precision   [Moffat  and  Zobel,  TOIS08]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Stop   6   7   8   9  10  …  
  32. 32. Model  Browsing  Behavior   black powder ammunition 1   Cascade-­‐based  models   2   3   4   The  chance  of  observing  a   5   document  depends  on  the  posion   6   7   of  the  document  in  the  ranked  list   8   and  the  relevance  of  documents/ 9   snippets  already  viewed.  10  …  
  33. 33. Expected  Reciprocal  Rank   [Chapelle  et  al  CIKM09]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Relevant?   6   7   8   highly   somewhat   no   9  10  …   Stop  
  34. 34. Expected  Browsing  Ulity   [Yilmaz  et  al  CIKM10]   DEBU (r) = P(Er )⋅ P(C | Rr ) n EBU = ∑ DEBU (r)⋅ Rr r =1 €
  35. 35. Probability  of  a  path  Q1   Q2   Q3   N   R   R   N   R   R   Joint  probability  of   N   R   R   N   R   R   N   R   R   (1)   abandoning  at  reform  2     N   N   R   N   N   R   N   N   R   (2)   reformulang  at  rank  3   N   N   R   of  first  query   N   N   R   …   …   …  
  36. 36. Probability  of  a  path  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   (1)   Probability  of  abandoning   N   R   R   at  reform  2   N   N   R   X   N   N   R   Probability  of   N   N   R   (2)   reformulang  at  rank  3   N   N   R   N   N   R   of  first  query   …   …   …  
  37. 37. Geometric  w/  parameter  preform  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   of  abandoning     N   N   R   (1)   the  session  at     N   N   R   reformulaon  i   N   N   R   N   N   R   …   …   …  
  38. 38. Truncated  Geometric     w/  parameter  preform  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   of  abandoning     N   N   R   (1)   the  session  at     N   N   R   reformulaon  i   N   N   R   N   N   R   …   …   …  
  39. 39. Truncated  Geometric     w/  parameter  preform   Q1   Q2   Q3   N   R   R  Geometric  w/  parameter  pdown   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   N   N   R   (2)  of  reformulang   N   N   R   at  rank  j     N   N   R   (of  1  to  i-­‐1  reform)   N   N   R   …   …   …  
  40. 40. Model-­‐based  measures   Probabilisc  space  of  users  following     different  paths  •  Ω  is  the  space  of  all  paths  •  P(ω)  is  the  prob  of  a  user  following  a  path  ω  in  Ω  •  Mω  is  a  measure  over  a  path  ω   esM = P (ω)Mω ω∈Ω
  41. 41. Evaluaon  measures  •  Evaluang  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  42. 42. Evaluaon  measures  •  Properes   –  How  do  the  new  measures  correlate  with   previously  introduced?   –  Do  they  behave  as  expected,  i.e.  do  they  reward   early  retrieval  of  relevant  documents?  
  43. 43. Correlaons   •  TREC  2010  Session  track   nsDCG vs. esNDCG nsDCG vs. esAP Kendalls tau : 0.7972 Kendalls tau : 0.5247 0.20 0.08esNDCG 0.15 esAP 0.06 0.10 0.04 0.10 0.15 0.20 0.10 0.15 0.20 nsDCG nsDCG
  44. 44. Reward  early  retrieval  •  TREC9  Query  track   –  50  topics  and  23  query  sets  (formulaons)  •  Simulate  sessions   esMPC@20   esMRC@20   esMAP  “good”-­‐”good”   0.378   0.036   0.122  “good”-­‐”bad”   0.363         0.034         0.112        “bad”-­‐”good”   0.271         0.023         0.083        “bad”-­‐”bad”   0.254         0.022         0.073        
  45. 45. Conclusions  •  Extend  the  evaluaon  framework  to  sessions   –  Built  the  appropriate  test  collecon   –  Rethink  of  evaluaon  measures  •  Basic  test  collecon  •  Model-­‐free  and  model-­‐based  measures  •  Did  not  talk  about:   –  Duplicate  documents   –  Efficient  computaon  of  the  measures  

×