Evalua&ng	  Mul&-­‐Query	  Sessions	  Evangelos	  Kanoulas*,	  Ben	  Cartere9e+,	  Paul	  Clough*,	  Mark	  Sanderson$	   ...
Why	  sessions?	  •  Current	  evalua&on	  framework	       –  Assesses	  the	  effec&veness	  of	  systems	  over	  one-­‐...
Why	  sessions?	          When was the DuPont Science Essay Contest created?        Ini&al	  Query	  : DuPont Science Essa...
Extend	  the	  evalua&on	  framework	   	  	   	   	  From	  one	  query	  evalua&on	   	  	   	   	   	   	  To	  mul&-­‐...
Construct	  appropriate	  test	  collec&ons	      Rethink	  of	  evalua&on	  measures	  
What	  is	  the	  appropriate	  collec&on?	  
Test	  collec&ons	  we	  built…	  •  Text	  REtrieval	  Conference	  (TREC)	     –  sponsored	  by	  NIST	     –  many	  c...
Test	  collec&on	  we	  built	  in	  2010…	  •  Corpus:	  ClueWeb09	     –  1	  billion	  web	  pages	  (5TB	  compressed)...
Some	  Cri&cism…	  •  Ar&ficial	  reformula&ons	  •  Short	  reformula&ons	      –  just	  2	  queries	  •  No	  other	  us...
Test	  Collec&on	  in	  2011	  •  Corpus:	  ClueWeb09	      –  1	  billion	  web	  pages	  (5TB	  compressed)	  •  Queries...
Basic	  test	  collec&on	  •  A	  set	  of	  informa&on	  needs	  What do we know about black powder ammunition?    –  A	 ...
Experiment	           black powder   black powder   gun powder          ammunition        wiki          wiki 1	   2	   3	 ...
Evalua&on	  over	  a	  single	  ranked	  list	               Experiment	               black powder   black powder   gun p...
Construct	  appropriate	  test	  collec&ons	      Rethink	  of	  evalua&on	  measures	  
What	  is	  a	  good	  system?	  
How	  can	  we	  measure	  “goodness”?	  
Measuring	  “goodness”	   The	  user	  steps	  down	  a	  ranked	  list	  of	  documents	  and	   observes	  each	  one	  ...
What	  are	  the	  challenges?	  
Evalua&on	  oover	  aul&ple	  ranked	  lists	   Evalua&on	   ver	  m 	  single	        ist	                black powder   ...
Exis&ng	  measures	  •  Session	  DCG	  [Järvelin	  et	  al	  ECIR	  2008]	      The	  user	  steps	  down	  the	  ranked	...
Evalua&ng	  over	  paths	  Op&mize 	   	  	   	  Model-­‐free	  measures	  Integrate	  out	  	  	   	  Model-­‐based	  m...
Evalua&on	  measures	  •  Evalua&ng	  over	  paths	  •  Model	  –	  free	  measures	  •  Model	  –	  based	  measures	  
Model-­‐free	  measures	         The	  user	  is	  an	  oracle	  that	  knows	  when	  to	                                ...
Model-­‐free	  measures	  Q1	     Q2	     Q3	   N	      R	      R	                              ω(10,3)	  :	  length	  10,...
Model-­‐free	  measures	  Q1	     Q2	     Q3	   N	      R	      R	   N	      R	      R	   N	      R	      R	              ...
Model-­‐free	  measures	                    Q1	           Q2	                Q3	                     N	            R	     ...
Model-­‐free	  measures	  Q1	     Q2	     Q3	   N	      R	      R	   N	      R	      R	   N	      R	      R	              ...
Evalua&on	  measures	  •  Evalua&ng	  over	  paths	  •  Model	  –	  free	  measures	  •  Model	  –	  based	  measures	  
Model-­‐based	  measures	              Probabilis&c	  space	  of	  users	  following	  	                         different	...
Model	  Browsing	  Behavior	           black powder          ammunition 1	                               Posion-­‐based	  ...
Rank	  Biased	  Precision	                          [Moffat	  and	  Zobel,	  TOIS08]    	           black powder           ...
Model	  Browsing	  Behavior	           black powder          ammunition 1	                              Cascade-­‐based	  ...
Expected	  Reciprocal	  Rank	                       [Chapelle	  et	  al	  CIKM09]	           black powder                 ...
Expected	  Browsing	  Ulity	         [Yilmaz	  et	  al	  CIKM10]	                                DEBU (r) = P(Er )⋅ P(C | ...
Probability	  of	  a	  path	  Q1	      Q2	     Q3	   N	       R	      R	   N	       R	      R	                  Joint	  pr...
Probability	  of	  a	  path	  Q1	      Q2	     Q3	   N	       R	      R	   N	       R	      R	   N	       R	      R	   N	 ...
Geometric	  w/	  parameter	  preform	  Q1	     Q2	          Q3	   N	      R	           R	   N	      R	           R	   N	  ...
Truncated	  Geometric	  	                                 w/	  parameter	  preform	  Q1	     Q2	          Q3	   N	      R	...
Truncated	  Geometric	  	                                                                             w/	  parameter	  pre...
Model-­‐based	  measures	              Probabilisc	  space	  of	  users	  following	  	                         different	 ...
Evaluaon	  measures	  •  Evaluang	  over	  paths	  •  Model	  –	  free	  measures	  •  Model	  –	  based	  measures	  
Evaluaon	  measures	  •  Properes	     –  How	  do	  the	  new	  measures	  correlate	  with	        previously	  introduc...
Correlaons	           •  TREC	  2010	  Session	  track	                   nsDCG vs. esNDCG                                ...
Reward	  early	  retrieval	  •  TREC9	  Query	  track	        –  50	  topics	  and	  23	  query	  sets	  (formulaons)	  • ...
Conclusions	  •  Extend	  the	  evaluaon	  framework	  to	  sessions	      –  Built	  the	  appropriate	  test	  collecon	...
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
Upcoming SlideShare
Loading in...5
×

Evangelos Kanoulas — Advances in Information Retrieval Evaluation

3,617

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,617
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Evangelos Kanoulas — Advances in Information Retrieval Evaluation

  1. 1. Evalua&ng  Mul&-­‐Query  Sessions  Evangelos  Kanoulas*,  Ben  Cartere9e+,  Paul  Clough*,  Mark  Sanderson$   *  University  of  Sheffield,  UK  +  University  of  Delaware,  USA   $  RMIT  University,  Australia  
  2. 2. Why  sessions?  •  Current  evalua&on  framework   –  Assesses  the  effec&veness  of  systems  over  one-­‐ shot  queries  •  Users  reformulate  their  ini&al  query  •  S&ll  fine  if  …   –  op&mizing  system  for  one-­‐shot  queries  led  to   op&mal  performance  over  an  en&re  session    
  3. 3. Why  sessions?   When was the DuPont Science Essay Contest created? Ini&al  Query  : DuPont Science Essay Contest Reformula&on  :  When was the DSEC created?•  e.g.  retrieval  systems  should  accumulate   informa&on  along  a  session  
  4. 4. Extend  the  evalua&on  framework          From  one  query  evalua&on              To  mul&-­‐query  sessions  evalua&on  
  5. 5. Construct  appropriate  test  collec&ons   Rethink  of  evalua&on  measures  
  6. 6. What  is  the  appropriate  collec&on?  
  7. 7. Test  collec&ons  we  built…  •  Text  REtrieval  Conference  (TREC)   –  sponsored  by  NIST   –  many  compe&&ons;  among  them        Session  Track  2010,  2011,  …  
  8. 8. Test  collec&on  we  built  in  2010…  •  Corpus:  ClueWeb09   –  1  billion  web  pages  (5TB  compressed)  •  Queries  and  Reformula&ons   –  150  query  pairs:  ini$al  query,  reformula$on   –  3  types  of  reformula&ons  (not  disclosed  to   par&cipants)   •  Specifica&on  (52  query  pairs)   •  Generaliza&on  (48  query  pairs)   •  Drifing  /  Parallel  Reformula&on  (50  query  pairs)  
  9. 9. Some  Cri&cism…  •  Ar&ficial  reformula&ons  •  Short  reformula&ons   –  just  2  queries  •  No  other  user  interac&on  data   –  clicks,  dwell  &mes,  etc.  •  Reformula&ons  are  sta&c  (do  not  depend  on  the   SE’s  response)   –  The  collec&on  does  not  allow  early  abandonment   –  The  reformula&on  itself  does  not  change  up  on  SE’s   response  
  10. 10. Test  Collec&on  in  2011  •  Corpus:  ClueWeb09   –  1  billion  web  pages  (5TB  compressed)  •  Queries  and  Reformula&ons   –  Real  users  searching  ClueWeb09   –  76  sessions  of  2  up  10  reformula&ons  •  Other  interac&ons   –  Clicks,  dwell  &mes,  mouse  movements,  relevance   judgments  •  But…  reformula&ons  are  s&ll  sta&c  
  11. 11. Basic  test  collec&on  •  A  set  of  informa&on  needs  What do we know about black powder ammunition? –  A  sta&c  sequence  of  m  queries   Ini&al  Query  :   black powder ammunition 1st  Reformula&on  :   black powder wiki gun powder wiki 2nd  Reformula&on  :   …   … (m-­‐1)th  Reformula&on  :   history of gunpowder
  12. 12. Experiment   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9  10  …  
  13. 13. Evalua&on  over  a  single  ranked  list   Experiment   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9   10   …  
  14. 14. Construct  appropriate  test  collec&ons   Rethink  of  evalua&on  measures  
  15. 15. What  is  a  good  system?  
  16. 16. How  can  we  measure  “goodness”?  
  17. 17. Measuring  “goodness”   The  user  steps  down  a  ranked  list  of  documents  and   observes  each  one  of  them  un&l  a  decision  point   and  either   a)   abandons  the  search,  or   b)   reformulates    While  stepping  down  or  sideways,  the  user  accumulates  u&lity    
  18. 18. What  are  the  challenges?  
  19. 19. Evalua&on  oover  aul&ple  ranked  lists   Evalua&on   ver  m  single   ist   black powder black powder gun powder ammunition wiki wiki 1   2   3   4   5   6   7   8   9   10   …  
  20. 20. Exis&ng  measures  •  Session  DCG  [Järvelin  et  al  ECIR  2008]   The  user  steps  down  the  ranked  list  un&l  rank  k  and   reformulates  [Determinis&c;  no  early  abandonment]  •  Expected  session  u&lity  [Yang  and  Lad  ICTIR  2009]   The  user  steps  down  a  ranked  list  of  documents  un&l   a  decision  point  and  reformulates  [Stochas&c;  no   early  abandonment]  
  21. 21. Evalua&ng  over  paths  Op&mize        Model-­‐free  measures  Integrate  out        Model-­‐based  measures  
  22. 22. Evalua&on  measures  •  Evalua&ng  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  23. 23. Model-­‐free  measures   The  user  is  an  oracle  that  knows  when  to   reformulate  Ω(k,j)  :  paths  of  length  k,  ending  at  reformula&on  j   Count  number  of  relevant  docs  on  the  op&mal  path   ω  of  length  k  ending  at  query  j  
  24. 24. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   ω(10,3)  :  length  10,  ending  at  3rd  query   N   R   R   Define  :   N   R   R   N   R   R   N   R   R   Precision@k,j   N   N   R   Recall@k,j   N   N   R   Precision@recall,j   N   N   R   N   N   R   N   N   R   …   …   …  
  25. 25. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   precision N   R   R   N   R   R   N   N   R   N   N   R   N   N   R   ref orm N   N   R   ula tio N   N   R   all n rec …   …   …  
  26. 26. Model-­‐free  measures   Q1   Q2   Q3   N   R   R   N   R   R   ranking 1 ranking 2 ranking 3 N   R   R   1.0 1.0 1.0 N   R   R   0.8 0.8 0.8 N   R   R   0.6 0.6 0.6precision precision precision N   N   R   0.4 0.4 0.4 0.2 0.2 0.2 N   N   R   0.0 0.0 0.0 N   0.0 0.2 N   0.4 R   0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall recall recall N   N   R   N   N   R   …   …   …  
  27. 27. Model-­‐free  measures  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   precision N   R   R   N   R   R   N   N   R   N   N   R   N   N   R   ref orm N   N   R   ula tio N   N   R   all n rec …   …   …  
  28. 28. Evalua&on  measures  •  Evalua&ng  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  29. 29. Model-­‐based  measures   Probabilis&c  space  of  users  following     different  paths  •  Ω  is  the  space  of  all  paths  •  P(ω)  is  the  prob  of  a  user  following  a  path  ω  in  Ω  •  Mω  is  a  measure  over  a  path  ω   esM = P (ω)Mω ω∈Ω [Yang  and  Lad  ICTIR  2009]  
  30. 30. Model  Browsing  Behavior   black powder ammunition 1   Posion-­‐based  models   2   3   4   The  chance  of  observing  a   5   document  depends  on  the  posion   6   7   of  the  document  in  the  ranked  list.   8   9  10  …  
  31. 31. Rank  Biased  Precision   [Moffat  and  Zobel,  TOIS08]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Stop   6   7   8   9  10  …  
  32. 32. Model  Browsing  Behavior   black powder ammunition 1   Cascade-­‐based  models   2   3   4   The  chance  of  observing  a   5   document  depends  on  the  posion   6   7   of  the  document  in  the  ranked  list   8   and  the  relevance  of  documents/ 9   snippets  already  viewed.  10  …  
  33. 33. Expected  Reciprocal  Rank   [Chapelle  et  al  CIKM09]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Relevant?   6   7   8   highly   somewhat   no   9  10  …   Stop  
  34. 34. Expected  Browsing  Ulity   [Yilmaz  et  al  CIKM10]   DEBU (r) = P(Er )⋅ P(C | Rr ) n EBU = ∑ DEBU (r)⋅ Rr r =1 €
  35. 35. Probability  of  a  path  Q1   Q2   Q3   N   R   R   N   R   R   Joint  probability  of   N   R   R   N   R   R   N   R   R   (1)   abandoning  at  reform  2     N   N   R   N   N   R   N   N   R   (2)   reformulang  at  rank  3   N   N   R   of  first  query   N   N   R   …   …   …  
  36. 36. Probability  of  a  path  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   (1)   Probability  of  abandoning   N   R   R   at  reform  2   N   N   R   X   N   N   R   Probability  of   N   N   R   (2)   reformulang  at  rank  3   N   N   R   N   N   R   of  first  query   …   …   …  
  37. 37. Geometric  w/  parameter  preform  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   of  abandoning     N   N   R   (1)   the  session  at     N   N   R   reformulaon  i   N   N   R   N   N   R   …   …   …  
  38. 38. Truncated  Geometric     w/  parameter  preform  Q1   Q2   Q3   N   R   R   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   of  abandoning     N   N   R   (1)   the  session  at     N   N   R   reformulaon  i   N   N   R   N   N   R   …   …   …  
  39. 39. Truncated  Geometric     w/  parameter  preform   Q1   Q2   Q3   N   R   R  Geometric  w/  parameter  pdown   N   R   R   N   R   R   N   R   R   N   R   R   Probability     N   N   R   N   N   R   (2)  of  reformulang   N   N   R   at  rank  j     N   N   R   (of  1  to  i-­‐1  reform)   N   N   R   …   …   …  
  40. 40. Model-­‐based  measures   Probabilisc  space  of  users  following     different  paths  •  Ω  is  the  space  of  all  paths  •  P(ω)  is  the  prob  of  a  user  following  a  path  ω  in  Ω  •  Mω  is  a  measure  over  a  path  ω   esM = P (ω)Mω ω∈Ω
  41. 41. Evaluaon  measures  •  Evaluang  over  paths  •  Model  –  free  measures  •  Model  –  based  measures  
  42. 42. Evaluaon  measures  •  Properes   –  How  do  the  new  measures  correlate  with   previously  introduced?   –  Do  they  behave  as  expected,  i.e.  do  they  reward   early  retrieval  of  relevant  documents?  
  43. 43. Correlaons   •  TREC  2010  Session  track   nsDCG vs. esNDCG nsDCG vs. esAP Kendalls tau : 0.7972 Kendalls tau : 0.5247 0.20 0.08esNDCG 0.15 esAP 0.06 0.10 0.04 0.10 0.15 0.20 0.10 0.15 0.20 nsDCG nsDCG
  44. 44. Reward  early  retrieval  •  TREC9  Query  track   –  50  topics  and  23  query  sets  (formulaons)  •  Simulate  sessions   esMPC@20   esMRC@20   esMAP  “good”-­‐”good”   0.378   0.036   0.122  “good”-­‐”bad”   0.363         0.034         0.112        “bad”-­‐”good”   0.271         0.023         0.083        “bad”-­‐”bad”   0.254         0.022         0.073        
  45. 45. Conclusions  •  Extend  the  evaluaon  framework  to  sessions   –  Built  the  appropriate  test  collecon   –  Rethink  of  evaluaon  measures  •  Basic  test  collecon  •  Model-­‐free  and  model-­‐based  measures  •  Did  not  talk  about:   –  Duplicate  documents   –  Efficient  computaon  of  the  measures  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×