Optimized interleaving for online retrieval evaluation
Upcoming SlideShare
Loading in...5
×
 

Optimized interleaving for online retrieval evaluation

on

  • 157 views

'Best Paper Award' in WSDM'13

'Best Paper Award' in WSDM'13

The paper talks about a generized way to interleave two ranking list of retrieval task.

Statistics

Views

Total Views
157
Views on SlideShare
157
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Introduction? <br />
  • And of course, IPC <br />
  • f(i) is proportional to 1/i <br />
  • Need some explanation: <br /> e.g. rank*(d, A) = position of d in A, or |A|+1 if doesn’t exist <br /> So for any pair , with (i rank*(q, A) && rank*(p, B) > rank*(q, B): not possible, that means (d1,d2) & (d1,d2) creates (d2,d1) <br /> rank*(p, A) > rank*(q, A) && rank*(p, B) rank*(q, B): misorder, also misorder in A, B <br />
  • Breaking case comes when one of the rankings is preferred more often than another, this is omitted by sum(p) = 0 constraint <br /> Insensitivity comes because weight of position is not taken into consideration when doing evaluation <br /> Property 7 is guaranteed by property 4 <br />
  • To maximize sensitivity, we might be able to solve the problem with less constraints? <br /> Seems that the author enforce L != A and L != B, so that we get fewer unknown factor? <br />
  • Hmm... A=(d1, d2, d3, d4)  B=(d2, d1, d4, d5), while L_2=(d1, d2, d4, d3). <br /> misorder(A,L_2) = {(d4, d3)}, misorder(B,L_2)={(d1,d2)}, misorder(A,B)={(d1,d2)} <br /> So... misorder(A,L_2) + misorder(B,L_2) > misroder(A,B)??? <br /> Be careful: misorder(B, L_2) = {(d1, d2), (d5, d3)}, misorder(A,B) = {(d1, d2), (d3, d5), (d3, d4)} <br />
  • The Pearson correlation is a little too small? <br />

Optimized interleaving for online retrieval evaluation Optimized interleaving for online retrieval evaluation Presentation Transcript

  • Optimized Interleaving for Online Retrieval Evaluation (Best paper in WSDM’13) Author: Filip Radlinski, Nick Craswell Slides By: Han Jiang
  • Agenda Basic concepts Previous algorithms Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
  • Basic concepts What is interleaving? Merge results from different retrieval algorithms. Only a combined list is shown to user. The quality of algorithms can be infered with the help of clickthrough data. Interleaved list Search Engine A Search Engine B Source List A Query Source List B Interleaving Algorithm Assignment Clicks Credit function Evaluation Result
  • Basic concepts + Ah, that’s easy…how about: Interleaving method = pickup best results from each algorithms? Wait… how do we know whether d 1 is better than d4? OK, then toss a coin instead, and Credit function = if di is clicked and higher in ranker A, prefer A. Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…
  • Basic concepts ++ So, what is a good interleaving algorithm? Intuitively*, a good one should: Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience Lead to clicks that reflect the user’s preference [*] Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02
  • Agenda Basic concepts √ Previous algorithms Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
  • Previous Algorithms Balanced Interleaving toss a coin once, pick up best items by turns. Team Draft Interleaving toss a coin every two times, pick up best item from winner first Probabilistic Interleaving toss a coin every time, sample item from winner A weight function ensures that doc in higher rank has higher probability to be picked up
  • Previous Algorithms + About credit functions, only documents that are clicked by users are considered Balanced Interleaving (coin=A) A: B: B: B: d1 1 d44 d d4 d2 2 d11 d d1 d3 3 d22 d d2 d4A wins 4 d d33 M: d1 d4 d2 d3 clicks on: d1 d3 Team Draft Interleaving (coin=AA) A: d1 d d3 d4 A: d1 d22 d3 d4 B: d4 d1 d2 d3 B: d4 d1 d2 d3 tie M: d1 d4 d2 d3 clicks on: d1 d3 Probabilistic Interleaving (possible coin=AA, AB) A: d1 d2 d3 d4 1 2 3 4 B: d4 d1 d2 d3 4 1 2 3 A: d1 d2 d3 d4 1 2 3 4 B: d4 d1 d2 d3 4 1 2 3 M: d1 d4 d2 d3 clicks on: d1 d3 A wins with p=100%
  • Agenda Basic concepts √ Previous algorithms √ Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
  • Invert the problem Why previous algorithms are not good enough: Balanced interleaving & Team Draft interleaving: biased Even a random click on the document raises up a winner. Probabilistic interleaving: degrading the user experience blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1) Therefore, the problem of interleaving should be more constrained A good way is to start from the principles…
  • Refine the problem Again, what is a good interleaving algorithm? Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) preference: Lead to clicks that reflect the user’s preference If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker Be sensitive to input data (fewest user queries show significant preference)
  • Refine the problem + Again, what is a good interleaving algorithm? Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) Lead to clicks that reflect the user’s preference: If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker Be sensitive to input data (fewest user queries show significant preference)
  • Refine the problem ++ Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) A=(d1, d2), B=(d1,d2), M = (d1, d2) Lead to clicks that reflect the user’s preference: If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker a possible interleaved list under previous constraints length of list num of clicks score function, when >0, assign score to A, otherwise to B
  • Refine the problem +++ Be sensitive to input data (fewest user queries show significant preference)
  • Refine the problem ++++ So the constraint is: And target is: With variable: the definition of
  • Define predict function: δ Linear Rank difference: Inverse Rank: Since it is a optimization problem, the existence of solution should be guaranteed theoretically. While in the paper it is only guaranteed empirically.
  • Theoretical Benefits PROPERTY 1: Balanced interleaving ⊆ This framework PROPERTY 2: Team Draft interleaving ⊆ This framework PROPERTY 3: This framework ⊆ Probabilistic interleaving PROPERTY 4: The merged list is something “in between” the two
  • Theoretical Benefits + PROPERTY 5: Breaking case in Balanced interleaving is omitted PROPERTY 6: Insensitivity in Team Draft interleaving is improved PROPERTY 7: Probabilistic interleaving will degrade more user experience
  • Illustration An option to pursue is sensitivity L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0 Note: the number of constraint is 5, but unknown factor is 6? (it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}
  • Agenda Basic concepts √ Previous algorithms √ Framework √ Invert Problem √ Refine Problem √ Theoretical benefits √ Illustration √ Evaluation Discussion
  • Evaluation: summary Construct a dataset to simulate interleaving and user interact Evaluate Pearson correlation between each two algorithms. Analyze cases that algorithms disagree Evaluate result quality by experts Analyze bias and sensitivity among algorithms
  • Evaluation +: construction of dataset Collect all query as well as top-4 results from a search engine Since the web and algorithm is changing, there are many distinct rankings for the same query. For each query, make sure that there’re at least 4 distinct rankings, each shown to user at least 10 times, with at least 1 click. The most frequent ranking sequence is regarded as A, a most dissimilar one is regarded as B. Further filter the log, so that results produced by either Balanced interleaving and Team Draft interleaving are frequent.
  • Evaluation ++
  • Evaluation +++
  • Evaluation ++++ Bias comparison among different algorithms
  • Evaluation +++++ Sensitivity comparison among different algorithms
  • Agenda Basic concepts √ Previous algorithms √ Framework √ Invert Problem √ Refine Problem √ Theoretical benefits √ Illustration √ Evaluation √ Discussion
  • Discussion Contribution in this paper: Invert the question of obtaining interleaving algorithms as a constrained optimization problem The solution is very intuitive, and general Many interesting examples to illustrate the breaking cases for previous approaches Note: The evaluation is simulated on logs from only one search engine. For interleaving, we’re expecting an evaluation based on different search engines? And that is why human evaluation result is not good among all algorithms.
  • Discussion + “A and B are not shown to users as they have low sensitivity” This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83, which is high?
  • Thank You !