Your SlideShare is downloading. ×
Ben Carterett — Advances in Information Retrieval Evaluation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Ben Carterett — Advances in Information Retrieval Evaluation

2,219
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,219
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. System  Effec+veness,     User  Models,  and  User  U+lity  A  Conceptual  Framework  for  Inves+ga+on   Ben  CartereBe   University  of  Delaware   carteret@cis.udel.edu  
  • 2. Effec+veness  Evalua+on  •  Determine  how  good  the  system  is  at  finding  and   ranking  relevant  documents  •  An  effec+veness  measure  should  be  correlated  to  the   user’s  experience   –  Value  increases  when  user  experience  gets  beBer;   decreases  when  it  gets  worse  •  Thus  interest  in  effec+veness  measures  based  on   explicit  models  of  user  interac+on   –  RBP  [Moffat  &  Zobel],  DCG  [Järvelin  &  Kekäläinen],  ERR   [Chapelle  et  al.],  EBU  [Yilmaz  et  al.],  sessions  [Kanoulas  et   al.],  etc.  
  • 3. Discounted  Gain  Model  •  Simple  model  of  user  interac+on:   –  User  steps  down  ranked  results  one-­‐by-­‐one   –  Gains  something  from  relevant  documents   –  Increasingly  less  likely  to  see  documents  deeper  in  the   ranking  •  Implementa+on  of  model:   –  Gain  is  a  func+on  of  relevance  at  rank  k   –  Ranks  k  are  increasingly  discounted   –  Effec+veness  =  sum  over  ranks  of  gain  +mes  discount  •  Most  measures  can  be  made  to  fit  this  framework  
  • 4. Rank  Biased  Precision     [Moffat  and  Zobel,  TOIS08]   black powder ammunition 1   2   Toss  a  biased  coin  (θ)   3   4   If  HEADS,  observe  next   5   6   document   7   8   IF  TAILS,  stop   9  10  …  
  • 5. Rank  Biased  Precision   black powder Let  θ=0.8   ammunition 1   0.532<θ   2   3   0.933≥θ   4   5   6   7   8   9  10  …  
  • 6. Rank  Biased  Precision   black powder ammunition Query   1   2   View  Next   Stop   3   Item   4   5   6   7   8   9  10  …  
  • 7. Rank  Biased  Precision   black powder ammunition 1   2   ∞ 3   RBP = (1 − θ) relk θk−1 4   k=1 5   ∞ 6   = relk θk−1 (1 − θ) 7   k=1 8   9  10   Relevance  discounted  by  …   geometric  distribu+on  
  • 8. Discounted  Cumula+ve  Gain   [Järvelin  and  Kekäläinen  SIGIR00]   black powder ammunition Relevance   Relevance     Discounted   Score   Gain   1   R   1 1 2   R   1 0.63 Discount   3   N   0 by  rank   0 4   N   0 0 DCG 1/log2(r+1)   NDCG = 5   R   1 0.38 optDCG 6   R   1 0.35 DCG  =  2.689   NDCG = 0.91 7   N   0 0 8   R   1 0.31 9   N   0 0 10   N   0 0 …   …   €
  • 9. Discounted  Cumula+ve  Gain   Relevance   0.0 0.2 0.4 0.6 0.8 1.0 1   R   2   R   3   N   4   N   5   R   ∞ 1 6   R   DCG = ∑ reli 7   N   i=1 log 2 (1+ i) 8   R   9   N  10   N   €…   …  
  • 10. Expected  Reciprocal  Rank   [Chapelle  et  al  CIKM09]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Stop   6   7   8   9  10  …  
  • 11. Expected  Reciprocal  Rank   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Relevant?   6   7   8   highly   somewhat   no   9  10  …   Stop  
  • 12. Models  of  Browsing  Behavior   black powder ammunition Posi+on-­‐based  models   1   The  chance  of  observing  a   2   document  depends  on  the  posi+on   3   of  the  document  in  the  ranked  list.   4   5   6   Cascade  models   7   The  chance  of  observing  a   8   document  depends  on  its  posi+on   9   as  well  as  the  relevance  of  10   documents  ranked  above  it.  …  
  • 13. A  More  Formal  Model  •  My  claim:    this  implementa+on  conflates  at  least  four   dis+nct  models  of  user  interac+on  •  Formalize  it  a  bit:   –  Change  rank  discount  to  stopping  probability  density  P(k)   –  Change  gain  func+on  to  either  a  u+lity  func+on  or  a  cost   func+on  •  Then  effec+veness  =  expected  u+lity  or  cost  over   stopping  points   ∞ M= f (k)P (k) k=1
  • 14. Our  Framework  •  The  components  of  a  measure  are:   –  stopping  rank  probability  P(k)   •  posi+on-­‐based  vs  cascade  is  a  feature  of  this  distribu+on   –  document  u+lity  model  (binary  relevance)   –  u+lity  accumula+on  model  or  cost  model  •  We  can  test  hypotheses  about  general  proper+es   of  stopping  distribu+on,  u+lity/cost  model   –  Instead  of  trying  to  evaluate  every  possible  measure   on  its  own,  evaluate  proper+es  of  the  measure  
  • 15. Model  Families  •  Depending  on  choices,  we  get  four  dis+nct   families  of  user  models   –  Each  family  characterized  by  u+lity/cost  model   –  Within  family,  freedom  to  choose  P(k),  document   u+lity  model  •  Model  1:    expected  u+lity  at  stopping  point  •  Model  2:    expected  total  u+lity  •  Model  3:    expected  cost  •  Model  4:    expected  total  u+lity  per  unit  cost  
  • 16. Model  1:       Expected  U+lity  at  Stopping  Point  •  Exemplar:    Rank-­‐Biased  Precision  (RBP)   ∞ RBP = (1 − θ) relk θk−1 k=1 ∞ = relk θk−1 (1 − θ) k=1•  Interpreta+on:   –  P(k)  =  geometric  density  func+on   –  f(k)  =  relevance  of  document  at  stopping  rank   –  Effec+veness  =  expected  relevance  at  stopping   rank  
  • 17. Model  2:    Expected  Total  U+lity  •  Instead  of  stopping  probability,  think  about  viewing   probability   ∞ P (view doc at k) = P (k) = F (k) i=k•  This  fits  in  discounted  gain  model  framework:   ∞ M= relk F (k) k=1•  Does  it  fit  in  expected  u+lity  framework?   –  Yes,  and  Discounted  Cumula+ve  Gain  (DCG;  Jarvelin  et  al.)   is  exemplar  for  this  class  
  • 18. Model  2:    Expected  Total  U+lity   ∞ ∞ ∞ M= relk F (k) = relk P (i) k=1 k=1 i=k ∞ k ∞ = P (k) reli = Rk P (k) k=1 i=1 k=1•  f(k)  =  Rk  (total  summed  relevance)  •  Let  FDCG(k)  =  1/log2(k+1)   –  Then  PDCG(k)  =  FDCG(k)  –  FDCG(k+1)     –                     PDCG(k)  =  1/log2(k+1)  –  1/log2(k+2)  •  Work  algebra  backwards  to  show  that  you  get  binary-­‐ relevance  DCG  (if  summing  to  infinity)  
  • 19. Model  3:    Expected  Cost  •  User  stops  with  probability  based  on   accumulated  u+lity  rather  than  rank  alone   –  P(k)  =  P(Rk)  if  document  at  rank  k  is  relevant,  0   otherwise  •  Then  use  f(k)  to  model  cost  of  going  to  rank  k  •  Exemplar  measure:    Expected  Reciprocal  Rank   (ERR;  Chapelle  et  al.)  (with  binary  relevance)   –  P(k)  =  relk · θ Rk −1 (1 − θ) –  1/cost  =  f(k)  =  1/k    
  • 20. Model  4:       Expected  U+lity  per  Unit  Cost  •  User  considers  expected  effort  of  further   browsing  axer  each  relevant  document   ∞ ∞ M= relk f (k)P (k) k=1 i=k•  Similar  to  M2  family,  manipulate  algebraically   ∞ ∞ ∞ k relk f (i)P (i) = f (k)P (k) reli k=1 i=k k=1 i=1 ∞ = f (k)Rk P (k) k=1
  • 21. Model  4:   Expected  U+lity  per  Unit  Cost  •  When  f(k)  =  1/k,  we  get:   ∞ M= prec@k · P (k) k=1•  Average  Precision  (AP)  is  exemplar  for  this   class   –  P(k)  =  relk/R   –  u+lity/cost  =  f(k)  =  prec@k  
  • 22. Summary  So  Far  •  Four  ways  to  turn  a  sum  over  gain  +mes   discounts  into  an  expecta+on  over  stopping  ranks   –  M1,  M2,  M3,  M4  •  Four  exemplar  measures  from  IR  literature   –  RBP,  DCG,  ERR,  AP  •  Four  stopping  probability  distribu+ons   –  PRBP,  PDCG,  PERR,  PAP   –  Add  two  more:       •  PRR(k)  =  1/(k(k+1)),  PRRR(k)  =  1/(Rk(Rk+1))  
  • 23. Stopping  Probability  Densi+es   1.0 0.5 PRBP = (1 !PERRFRBP =((1!!)k!1!1 )k!1ERR = 11!RRk F= relk ( ))k!1 PRR = 1 (k(kRRRF= rel= 1k RkRk + 1)) P + 1)) = k1 (Rk( FRR RRR P 2( F+rel) !R (R 2( k ) 1) PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R = AP k 1 log 1 0.8 0.4 cumulative probability 0.6 0.3 probability 0.4 0.2 0.2 0.1 0.0 5 10 15 20 25 rank
  • 24. From  Models  to  Measures  •  Six  stopping  probability  distribu+ons,  four   model  families  •  Mix  and  match  to  create  up  to  24  new   measures   –  Many  of  these  are  uninteres+ng:    isomorphic  to   precision/recall,  or  constant-­‐valued   –  15  turn  out  to  be  interes+ng  
  • 25. Measures  
  • 26. Some  Brief  Asides  •  From  geometric  to  reciprocal  rank   –  Integrate  geometric  w.r.t.  parameter  theta   –  Result  is  1/(k(k+1))   –  Cumula+ve  form  is  approximately  1/k  •  Normaliza+on   –  Every  measure  in  M2  family  must  be  normalized  by   max  possible  value   –  Other  measures  may  not  fall  between  0  and  1  
  • 27. Some  Brief  Asides  •  Rank  cut-­‐offs   –  DCG  formula+on  only  works  for  n  going  to  infinity   –  In  reality  we  usually  calculate  DCG@K  for  small  K   –  This  fits  our  user  model  if  we  make  worst-­‐case   assump+on  about  relevance  of  documents  below   rank  K  
  • 28. Analyzing  Measures  •  Some  ques+ons  raised:   –  Are  models  based  on  u+lity  beBer  than  models   based  on  effort?    (Hypothesis:  no  difference)   –  Are  measures  based  on  stopping  probabili+es   beBer  than  measures  based  on  viewing   probabili+es?    (Hypothesis:    laBer  more  robust)   –  What  proper+es  should  the  stopping  distribu+on   have?    (Hypothesis:    faBer  tail,  sta+c  more  robust)  
  • 29. How  to  Analyze  Measures  •  Many  possible  ways,  no  one  widely-­‐accepted   –  How  well  they  correlate  with  user  sa+sfac+on   –  How  robust  they  are  to  changes  in  underlying  data   –  How  good  they  are  for  op+mizing  systems   –  How  informa+ve  they  are  
  • 30. Fit  to  Click  Logs  •  How  well  does  a  stopping  distribu+on  fit  to   empirical  click  probabili+es?   –  A  click  does  not  mean  the  end  of  a  search   –  But  we  need  some  model  of  the  stopping  point,   and  a  click  is  a  decent  proxy  •  Good  fit  may  indicate  a  good  stopping  model  
  • 31. Fit  to  Logged  Clicks   empirical distribution PRBP = (1 ! )k!1 PRR = 1 (k(k + 1)) PDCG = 1 log2(k + 1) ! 1 log2(k + 2) 1e−02probability P(k) 1e−04 1e−06 1 2 5 10 20 50 100 200 500 rank k
  • 32. Robustness  and  Stability  •  How  robust  is  the  measure  to  changes  in   underlying  test  collec+on  data?   –  If  one  of  the  following  changes:   •  topic  sample   •  relevance  judgments   •  pool  depth  of  judgments   –  how  different  are  the  decisions  about  rela+ve   system  effec+veness?  
  • 33. Data  •  Three  test  collec+ons  +  evalua+on  data:   –  TREC-­‐6  ad  hoc:    50  topics,  72,270  judgments,  550,000-­‐ document  corpus;  74  runs  submiBed  to  TREC   •  Second  set  of  judgments  from  Waterloo   –  TREC  2006  Terabyte  named  page:    180  topics,  2361   judgments,  25M-­‐doc  corpus;  43  runs  submiBed  to   TREC   –  TREC  2009  Web  ad  hoc:    50  topics,  18,666  judgments,   500M-­‐doc  corpus;  37  runs  submiBed  to  TREC  
  • 34. Experimental  Methodology  •  Pick  some  part  of  the  collec+on  to  vary   –  e.g.  judgments,  topic  sample  size,  pool  depth  •  Evaluate  all  submiBed  systems  with  TREC’s  gold  standard   data  •  Evaluate  all  submiBed  systems  with  the  modified  data  •  Compare  first  evalua+on  to  second  using  Kendall’s  tau  rank   correla+on  •  Determine  which  proper+es  are  most  robust   –  Model  family,  tail  fatness,  sta+c/dynamic  distribu+on  
  • 35. Varying  Assessments  •  Compare  evalua+on  with  TREC’s  judgments  to   evalua+on  with  Waterloo’s   type   P(k)   M1   M2   M3   M4   mean   PRBP   RBP  =  0.813   RBTR  =  0.816   RBAP  =  0.801   0.810   Tenta+ve  conclusions:  •  sta+c   P CDG  =  0.831   DCG  =  0.920   DCG   DAG  =  0.819   0.857   –  M2  most  robust,  fRR  =  0.859   by  M3  (axer    r.812   0.830   P RRG  =  0.819   RR   ollowed   RAP  = 0 emoving   AP  outlier)   P ERR   ERR  =  0.829   EPR  =  0.836   0.833   dynamic   P ARR  =  0.847   AP  =  0.896   0.872   –  FaBer-­‐tail  distribu+ons  more  =r  0.826   RRAP  =  0.844   0.835   P AP   RRR   obust   RRR   mean  Dynamic  a0.821   more  robust  than  sta+c   –   bit   0.865   0.834   0.835  
  • 36. Varying  Topic  Sample  Size  •  Sample  a  subset  of  N  topics  from  the  original   50;  evaluate  systems  over  that  set   1.0 fat M1 tail: PDCG, PAP medium tail: PRR, PRRR M2 M3 slim tail: PRBP, PERR 0.9 M4 mean Kendall’s tau 0.8 0.7 0.6 0.5 10 20 30 40 number of topics
  • 37. Varying  Pool  Depth  •  Take  only  judgments  on  documents  appearing   at  ranks  1  to  depth  D  in  submiBed  systems   –  D  =  1,  2,  4,  8,  16,  32,  64   1.0 0.9 mean Kendall’s tau 0.8 0.7 0.6 M1 M2 M3 M4 0.5 1 2 5 10 20 50 pool depth
  • 38. Conclusions  •  FaBer-­‐tailed  distribu+ons  generally  more  robust   –  Maybe  beBer  for  mi+ga+ng  risk  of  not  sa+sfying  tail  users  •  M2  (expected  total  u+lity;  DCG)  generally  more  robust   –  But  does  it  model  users  beBer?  •  M3  (expected  cost;  ERR)  more  robust  than  expected  •  M4  (expected  u+lity  per  cost;  AP)  not  as  robust  as  expected   –  AP  is  an  outlier  with  a  very  fat  tail  •  DCG  may  be  based  on  a  more  realis+c  user  model  than   commonly  thought  
  • 39. Conclusions  •  The  gain  +mes  discount  formula+on  conflates  four   dis+nct  models  of  user  behavior  •  Teasing  these  apart  allows  us  to  test  hypotheses  about   general  proper+es  of  measures  •  This  is  a  conceptual  framework:    it  organizes  and   describes  measures  in  order  to  provide  structure  for   reasoning  about  general  proper+es      •  Hopefully  will  provide  direc+ons  for  future  research  on   evalua+on  measures  

×