• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Ben Carterett — Advances in Information Retrieval Evaluation
 

Ben Carterett — Advances in Information Retrieval Evaluation

on

  • 2,367 views

 

Statistics

Views

Total Views
2,367
Views on SlideShare
641
Embed Views
1,726

Actions

Likes
0
Downloads
1
Comments
0

7 Embeds 1,726

http://xss.yandex.net 1690
http://events.yandex.ru 18
http://events.lynx.yandex.ru 11
http://external.events.test.tools.yandex-team.ru 3
https://xss.yandex.net 2
http://events.yandex-team.ru 1
http://tech.yandex.ru 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Ben Carterett — Advances in Information Retrieval Evaluation Ben Carterett — Advances in Information Retrieval Evaluation Presentation Transcript

    • System  Effec+veness,     User  Models,  and  User  U+lity  A  Conceptual  Framework  for  Inves+ga+on   Ben  CartereBe   University  of  Delaware   carteret@cis.udel.edu  
    • Effec+veness  Evalua+on  •  Determine  how  good  the  system  is  at  finding  and   ranking  relevant  documents  •  An  effec+veness  measure  should  be  correlated  to  the   user’s  experience   –  Value  increases  when  user  experience  gets  beBer;   decreases  when  it  gets  worse  •  Thus  interest  in  effec+veness  measures  based  on   explicit  models  of  user  interac+on   –  RBP  [Moffat  &  Zobel],  DCG  [Järvelin  &  Kekäläinen],  ERR   [Chapelle  et  al.],  EBU  [Yilmaz  et  al.],  sessions  [Kanoulas  et   al.],  etc.  
    • Discounted  Gain  Model  •  Simple  model  of  user  interac+on:   –  User  steps  down  ranked  results  one-­‐by-­‐one   –  Gains  something  from  relevant  documents   –  Increasingly  less  likely  to  see  documents  deeper  in  the   ranking  •  Implementa+on  of  model:   –  Gain  is  a  func+on  of  relevance  at  rank  k   –  Ranks  k  are  increasingly  discounted   –  Effec+veness  =  sum  over  ranks  of  gain  +mes  discount  •  Most  measures  can  be  made  to  fit  this  framework  
    • Rank  Biased  Precision     [Moffat  and  Zobel,  TOIS08]   black powder ammunition 1   2   Toss  a  biased  coin  (θ)   3   4   If  HEADS,  observe  next   5   6   document   7   8   IF  TAILS,  stop   9  10  …  
    • Rank  Biased  Precision   black powder Let  θ=0.8   ammunition 1   0.532<θ   2   3   0.933≥θ   4   5   6   7   8   9  10  …  
    • Rank  Biased  Precision   black powder ammunition Query   1   2   View  Next   Stop   3   Item   4   5   6   7   8   9  10  …  
    • Rank  Biased  Precision   black powder ammunition 1   2   ∞ ￿ 3   RBP = (1 − θ) relk θk−1 4   k=1 5   ∞ ￿ 6   = relk θk−1 (1 − θ) 7   k=1 8   9  10   Relevance  discounted  by  …   geometric  distribu+on  
    • Discounted  Cumula+ve  Gain   [Järvelin  and  Kekäläinen  SIGIR00]   black powder ammunition Relevance   Relevance     Discounted   Score   Gain   1   R   1 1 2   R   1 0.63 Discount   3   N   0 by  rank   0 4   N   0 0 DCG 1/log2(r+1)   NDCG = 5   R   1 0.38 optDCG 6   R   1 0.35 DCG  =  2.689   NDCG = 0.91 7   N   0 0 8   R   1 0.31 9   N   0 0 10   N   0 0 …   …   €
    • Discounted  Cumula+ve  Gain   Relevance   0.0 0.2 0.4 0.6 0.8 1.0 1   R   2   R   3   N   4   N   5   R   ∞ 1 6   R   DCG = ∑ reli 7   N   i=1 log 2 (1+ i) 8   R   9   N  10   N   €…   …  
    • Expected  Reciprocal  Rank   [Chapelle  et  al  CIKM09]   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Stop   6   7   8   9  10  …  
    • Expected  Reciprocal  Rank   black powder Query   ammunition 1   View  Next   2   Item   3   4   5   Relevant?   6   7   8   highly   somewhat   no   9  10  …   Stop  
    • Models  of  Browsing  Behavior   black powder ammunition Posi+on-­‐based  models   1   The  chance  of  observing  a   2   document  depends  on  the  posi+on   3   of  the  document  in  the  ranked  list.   4   5   6   Cascade  models   7   The  chance  of  observing  a   8   document  depends  on  its  posi+on   9   as  well  as  the  relevance  of  10   documents  ranked  above  it.  …  
    • A  More  Formal  Model  •  My  claim:    this  implementa+on  conflates  at  least  four   dis+nct  models  of  user  interac+on  •  Formalize  it  a  bit:   –  Change  rank  discount  to  stopping  probability  density  P(k)   –  Change  gain  func+on  to  either  a  u+lity  func+on  or  a  cost   func+on  •  Then  effec+veness  =  expected  u+lity  or  cost  over   stopping  points   ∞ ￿ M= f (k)P (k) k=1
    • Our  Framework  •  The  components  of  a  measure  are:   –  stopping  rank  probability  P(k)   •  posi+on-­‐based  vs  cascade  is  a  feature  of  this  distribu+on   –  document  u+lity  model  (binary  relevance)   –  u+lity  accumula+on  model  or  cost  model  •  We  can  test  hypotheses  about  general  proper+es   of  stopping  distribu+on,  u+lity/cost  model   –  Instead  of  trying  to  evaluate  every  possible  measure   on  its  own,  evaluate  proper+es  of  the  measure  
    • Model  Families  •  Depending  on  choices,  we  get  four  dis+nct   families  of  user  models   –  Each  family  characterized  by  u+lity/cost  model   –  Within  family,  freedom  to  choose  P(k),  document   u+lity  model  •  Model  1:    expected  u+lity  at  stopping  point  •  Model  2:    expected  total  u+lity  •  Model  3:    expected  cost  •  Model  4:    expected  total  u+lity  per  unit  cost  
    • Model  1:       Expected  U+lity  at  Stopping  Point  •  Exemplar:    Rank-­‐Biased  Precision  (RBP)   ∞ ￿ RBP = (1 − θ) relk θk−1 k=1 ∞ ￿ = relk θk−1 (1 − θ) k=1•  Interpreta+on:   –  P(k)  =  geometric  density  func+on   –  f(k)  =  relevance  of  document  at  stopping  rank   –  Effec+veness  =  expected  relevance  at  stopping   rank  
    • Model  2:    Expected  Total  U+lity  •  Instead  of  stopping  probability,  think  about  viewing   probability   ∞ ￿ P (view doc at k) = P (k) = F (k) i=k•  This  fits  in  discounted  gain  model  framework:   ∞ ￿ M= relk F (k) k=1•  Does  it  fit  in  expected  u+lity  framework?   –  Yes,  and  Discounted  Cumula+ve  Gain  (DCG;  Jarvelin  et  al.)   is  exemplar  for  this  class  
    • Model  2:    Expected  Total  U+lity   ∞ ￿ ∞ ￿ ∞ ￿ M= relk F (k) = relk P (i) k=1 k=1 i=k ∞ ￿ k ￿ ∞ ￿ = P (k) reli = Rk P (k) k=1 i=1 k=1•  f(k)  =  Rk  (total  summed  relevance)  •  Let  FDCG(k)  =  1/log2(k+1)   –  Then  PDCG(k)  =  FDCG(k)  –  FDCG(k+1)     –                     PDCG(k)  =  1/log2(k+1)  –  1/log2(k+2)  •  Work  algebra  backwards  to  show  that  you  get  binary-­‐ relevance  DCG  (if  summing  to  infinity)  
    • Model  3:    Expected  Cost  •  User  stops  with  probability  based  on   accumulated  u+lity  rather  than  rank  alone   –  P(k)  =  P(Rk)  if  document  at  rank  k  is  relevant,  0   otherwise  •  Then  use  f(k)  to  model  cost  of  going  to  rank  k  •  Exemplar  measure:    Expected  Reciprocal  Rank   (ERR;  Chapelle  et  al.)  (with  binary  relevance)   –  P(k)  =  relk · θ Rk −1 (1 − θ) –  1/cost  =  f(k)  =  1/k    
    • Model  4:       Expected  U+lity  per  Unit  Cost  •  User  considers  expected  effort  of  further   browsing  axer  each  relevant  document   ∞ ￿ ∞ ￿ M= relk f (k)P (k) k=1 i=k•  Similar  to  M2  family,  manipulate  algebraically   ∞ ￿ ∞ ￿ ∞ ￿ k ￿ relk f (i)P (i) = f (k)P (k) reli k=1 i=k k=1 i=1 ￿∞ = f (k)Rk P (k) k=1
    • Model  4:   Expected  U+lity  per  Unit  Cost  •  When  f(k)  =  1/k,  we  get:   ∞ ￿ M= prec@k · P (k) k=1•  Average  Precision  (AP)  is  exemplar  for  this   class   –  P(k)  =  relk/R   –  u+lity/cost  =  f(k)  =  prec@k  
    • Summary  So  Far  •  Four  ways  to  turn  a  sum  over  gain  +mes   discounts  into  an  expecta+on  over  stopping  ranks   –  M1,  M2,  M3,  M4  •  Four  exemplar  measures  from  IR  literature   –  RBP,  DCG,  ERR,  AP  •  Four  stopping  probability  distribu+ons   –  PRBP,  PDCG,  PERR,  PAP   –  Add  two  more:       •  PRR(k)  =  1/(k(k+1)),  PRRR(k)  =  1/(Rk(Rk+1))  
    • Stopping  Probability  Densi+es   1.0 0.5 PRBP = (1 !PERRFRBP =((1!!")k!1!1 ")k!1"ERR = 11!""RRk " F= relk ( ))k!1 PRR = 1 (k(kRRRF= rel= 1k RkRk + 1)) P + 1)) = k1 (Rk( FRR RRR P 2( F+rel) !R (R 2( k ) 1) PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R = AP k 1 log 1 0.8 0.4 cumulative probability 0.6 0.3 probability 0.4 0.2 0.2 0.1 0.0 5 10 15 20 25 rank
    • From  Models  to  Measures  •  Six  stopping  probability  distribu+ons,  four   model  families  •  Mix  and  match  to  create  up  to  24  new   measures   –  Many  of  these  are  uninteres+ng:    isomorphic  to   precision/recall,  or  constant-­‐valued   –  15  turn  out  to  be  interes+ng  
    • Measures  
    • Some  Brief  Asides  •  From  geometric  to  reciprocal  rank   –  Integrate  geometric  w.r.t.  parameter  theta   –  Result  is  1/(k(k+1))   –  Cumula+ve  form  is  approximately  1/k  •  Normaliza+on   –  Every  measure  in  M2  family  must  be  normalized  by   max  possible  value   –  Other  measures  may  not  fall  between  0  and  1  
    • Some  Brief  Asides  •  Rank  cut-­‐offs   –  DCG  formula+on  only  works  for  n  going  to  infinity   –  In  reality  we  usually  calculate  DCG@K  for  small  K   –  This  fits  our  user  model  if  we  make  worst-­‐case   assump+on  about  relevance  of  documents  below   rank  K  
    • Analyzing  Measures  •  Some  ques+ons  raised:   –  Are  models  based  on  u+lity  beBer  than  models   based  on  effort?    (Hypothesis:  no  difference)   –  Are  measures  based  on  stopping  probabili+es   beBer  than  measures  based  on  viewing   probabili+es?    (Hypothesis:    laBer  more  robust)   –  What  proper+es  should  the  stopping  distribu+on   have?    (Hypothesis:    faBer  tail,  sta+c  more  robust)  
    • How  to  Analyze  Measures  •  Many  possible  ways,  no  one  widely-­‐accepted   –  How  well  they  correlate  with  user  sa+sfac+on   –  How  robust  they  are  to  changes  in  underlying  data   –  How  good  they  are  for  op+mizing  systems   –  How  informa+ve  they  are  
    • Fit  to  Click  Logs  •  How  well  does  a  stopping  distribu+on  fit  to   empirical  click  probabili+es?   –  A  click  does  not  mean  the  end  of  a  search   –  But  we  need  some  model  of  the  stopping  point,   and  a  click  is  a  decent  proxy  •  Good  fit  may  indicate  a  good  stopping  model  
    • Fit  to  Logged  Clicks   empirical distribution PRBP = (1 ! ")k!1" PRR = 1 (k(k + 1)) PDCG = 1 log2(k + 1) ! 1 log2(k + 2) 1e−02probability P(k) 1e−04 1e−06 1 2 5 10 20 50 100 200 500 rank k
    • Robustness  and  Stability  •  How  robust  is  the  measure  to  changes  in   underlying  test  collec+on  data?   –  If  one  of  the  following  changes:   •  topic  sample   •  relevance  judgments   •  pool  depth  of  judgments   –  how  different  are  the  decisions  about  rela+ve   system  effec+veness?  
    • Data  •  Three  test  collec+ons  +  evalua+on  data:   –  TREC-­‐6  ad  hoc:    50  topics,  72,270  judgments,  550,000-­‐ document  corpus;  74  runs  submiBed  to  TREC   •  Second  set  of  judgments  from  Waterloo   –  TREC  2006  Terabyte  named  page:    180  topics,  2361   judgments,  25M-­‐doc  corpus;  43  runs  submiBed  to   TREC   –  TREC  2009  Web  ad  hoc:    50  topics,  18,666  judgments,   500M-­‐doc  corpus;  37  runs  submiBed  to  TREC  
    • Experimental  Methodology  •  Pick  some  part  of  the  collec+on  to  vary   –  e.g.  judgments,  topic  sample  size,  pool  depth  •  Evaluate  all  submiBed  systems  with  TREC’s  gold  standard   data  •  Evaluate  all  submiBed  systems  with  the  modified  data  •  Compare  first  evalua+on  to  second  using  Kendall’s  tau  rank   correla+on  •  Determine  which  proper+es  are  most  robust   –  Model  family,  tail  fatness,  sta+c/dynamic  distribu+on  
    • Varying  Assessments  •  Compare  evalua+on  with  TREC’s  judgments  to   evalua+on  with  Waterloo’s   type   P(k)   M1   M2   M3   M4   mean   PRBP   RBP  =  0.813   RBTR  =  0.816   RBAP  =  0.801   0.810   Tenta+ve  conclusions:  •  sta+c   P CDG  =  0.831   DCG  =  0.920   DCG   DAG  =  0.819   0.857   –  M2  most  robust,  fRR  =  0.859   by  M3  (axer    r.812   0.830   P RRG  =  0.819   RR   ollowed   RAP  = 0 emoving   AP  outlier)   P ERR   ERR  =  0.829   EPR  =  0.836   0.833   dynamic   P ARR  =  0.847   AP  =  0.896   0.872   –  FaBer-­‐tail  distribu+ons  more  =r  0.826   RRAP  =  0.844   0.835   P AP   RRR   obust   RRR   mean  Dynamic  a0.821   more  robust  than  sta+c   –   bit   0.865   0.834   0.835  
    • Varying  Topic  Sample  Size  •  Sample  a  subset  of  N  topics  from  the  original   50;  evaluate  systems  over  that  set   1.0 fat M1 tail: PDCG, PAP medium tail: PRR, PRRR M2 M3 slim tail: PRBP, PERR 0.9 M4 mean Kendall’s tau 0.8 0.7 0.6 0.5 10 20 30 40 number of topics
    • Varying  Pool  Depth  •  Take  only  judgments  on  documents  appearing   at  ranks  1  to  depth  D  in  submiBed  systems   –  D  =  1,  2,  4,  8,  16,  32,  64   1.0 0.9 mean Kendall’s tau 0.8 0.7 0.6 M1 M2 M3 M4 0.5 1 2 5 10 20 50 pool depth
    • Conclusions  •  FaBer-­‐tailed  distribu+ons  generally  more  robust   –  Maybe  beBer  for  mi+ga+ng  risk  of  not  sa+sfying  tail  users  •  M2  (expected  total  u+lity;  DCG)  generally  more  robust   –  But  does  it  model  users  beBer?  •  M3  (expected  cost;  ERR)  more  robust  than  expected  •  M4  (expected  u+lity  per  cost;  AP)  not  as  robust  as  expected   –  AP  is  an  outlier  with  a  very  fat  tail  •  DCG  may  be  based  on  a  more  realis+c  user  model  than   commonly  thought  
    • Conclusions  •  The  gain  +mes  discount  formula+on  conflates  four   dis+nct  models  of  user  behavior  •  Teasing  these  apart  allows  us  to  test  hypotheses  about   general  proper+es  of  measures  •  This  is  a  conceptual  framework:    it  organizes  and   describes  measures  in  order  to  provide  structure  for   reasoning  about  general  proper+es      •  Hopefully  will  provide  direc+ons  for  future  research  on   evalua+on  measures