Upcoming SlideShare
Loading in …5
×

# Optimized interleaving for online retrieval evaluation

435 views
296 views

Published on

'Best Paper Award' in WSDM'13

The paper talks about a generized way to interleave two ranking list of retrieval task.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

No Downloads
Views
Total views
435
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Introduction?
• And of course, IPC
• f(i) is proportional to 1/i
• Need some explanation:
e.g. rank*(d, A) = position of d in A, or |A|+1 if doesn’t exist
So for any pair &lt;i,j&gt;, with (i&lt;j), pickup a pair in L as:
p = Li, q = Lj. it is supposed to see:
rank*(p, A) &lt;= rank*(q, A) &amp;&amp; rank*(p, B) &lt;= rank*(q, B): no misorder in A,B
rank*(p, A) &gt; rank*(q, A) &amp;&amp; rank*(p, B) &gt; rank*(q, B): not possible, that means (d1,d2) &amp; (d1,d2) creates (d2,d1)
rank*(p, A) &gt; rank*(q, A) &amp;&amp; rank*(p, B) &lt;= rank*(q, B): misorder, also misorder in A, B
rank*(p, A) &lt;= rank*(q, A) &amp;&amp; rank*(p, B) &gt; rank*(q, B): misorder, also misorder in A, B
• Breaking case comes when one of the rankings is preferred more often than another, this is omitted by sum(p) = 0 constraint
Insensitivity comes because weight of position is not taken into consideration when doing evaluation
Property 7 is guaranteed by property 4
• To maximize sensitivity, we might be able to solve the problem with less constraints?
Seems that the author enforce L != A and L != B, so that we get fewer unknown factor?
• Hmm... A=(d1, d2, d3, d4)  B=(d2, d1, d4, d5), while L_2=(d1, d2, d4, d3).
misorder(A,L_2) = {(d4, d3)}, misorder(B,L_2)={(d1,d2)}, misorder(A,B)={(d1,d2)}
So... misorder(A,L_2) + misorder(B,L_2) &gt; misroder(A,B)???
Be careful: misorder(B, L_2) = {(d1, d2), (d5, d3)}, misorder(A,B) = {(d1, d2), (d3, d5), (d3, d4)}
• The Pearson correlation is a little too small?
• ### Optimized interleaving for online retrieval evaluation

1. 1. Optimized Interleaving for Online Retrieval Evaluation (Best paper in WSDM’13) Author: Filip Radlinski, Nick Craswell Slides By: Han Jiang
2. 2. Agenda Basic concepts Previous algorithms Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
3. 3. Basic concepts What is interleaving? Merge results from different retrieval algorithms. Only a combined list is shown to user. The quality of algorithms can be infered with the help of clickthrough data. Interleaved list Search Engine A Search Engine B Source List A Query Source List B Interleaving Algorithm Assignment Clicks Credit function Evaluation Result
4. 4. Basic concepts + Ah, that’s easy…how about: Interleaving method = pickup best results from each algorithms? Wait… how do we know whether d 1 is better than d4? OK, then toss a coin instead, and Credit function = if di is clicked and higher in ranker A, prefer A. Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…
5. 5. Basic concepts ++ So, what is a good interleaving algorithm? Intuitively*, a good one should: Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience Lead to clicks that reflect the user’s preference [*] Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02
6. 6. Agenda Basic concepts √ Previous algorithms Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
7. 7. Previous Algorithms Balanced Interleaving toss a coin once, pick up best items by turns. Team Draft Interleaving toss a coin every two times, pick up best item from winner first Probabilistic Interleaving toss a coin every time, sample item from winner A weight function ensures that doc in higher rank has higher probability to be picked up
8. 8. Previous Algorithms + About credit functions, only documents that are clicked by users are considered Balanced Interleaving (coin=A) A: B: B: B: d1 1 d44 d d4 d2 2 d11 d d1 d3 3 d22 d d2 d4A wins 4 d d33 M: d1 d4 d2 d3 clicks on: d1 d3 Team Draft Interleaving (coin=AA) A: d1 d d3 d4 A: d1 d22 d3 d4 B: d4 d1 d2 d3 B: d4 d1 d2 d3 tie M: d1 d4 d2 d3 clicks on: d1 d3 Probabilistic Interleaving (possible coin=AA, AB) A: d1 d2 d3 d4 1 2 3 4 B: d4 d1 d2 d3 4 1 2 3 A: d1 d2 d3 d4 1 2 3 4 B: d4 d1 d2 d3 4 1 2 3 M: d1 d4 d2 d3 clicks on: d1 d3 A wins with p=100%
9. 9. Agenda Basic concepts √ Previous algorithms √ Framework Invert Problem Refine Problem Theoretical benefits Illustration Evaluation Discussion
10. 10. Invert the problem Why previous algorithms are not good enough: Balanced interleaving & Team Draft interleaving: biased Even a random click on the document raises up a winner. Probabilistic interleaving: degrading the user experience blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1) Therefore, the problem of interleaving should be more constrained A good way is to start from the principles…
11. 11. Refine the problem Again, what is a good interleaving algorithm? Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) preference: Lead to clicks that reflect the user’s preference If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker Be sensitive to input data (fewest user queries show significant preference)
12. 12. Refine the problem + Again, what is a good interleaving algorithm? Be blind to user. Be blind to retrieval functions. Be robust to biases in the user’s decision process (that do not relate to retrieval quality) Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) Lead to clicks that reflect the user’s preference: If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker Be sensitive to input data (fewest user queries show significant preference)
13. 13. Refine the problem ++ Not substantially alter the search experience (show one of the rankings, or a ranking “in between” the two) A=(d1, d2), B=(d1,d2), M = (d1, d2) Lead to clicks that reflect the user’s preference: If document d is clicked, the input ranker that ranked d higher is given more credit A randomly clicking user doesn’t create a preference for either ranker a possible interleaved list under previous constraints length of list num of clicks score function, when >0, assign score to A, otherwise to B
14. 14. Refine the problem +++ Be sensitive to input data (fewest user queries show significant preference)
15. 15. Refine the problem ++++ So the constraint is: And target is: With variable: the definition of
16. 16. Define predict function: δ Linear Rank difference: Inverse Rank: Since it is a optimization problem, the existence of solution should be guaranteed theoretically. While in the paper it is only guaranteed empirically.
17. 17. Theoretical Benefits PROPERTY 1: Balanced interleaving ⊆ This framework PROPERTY 2: Team Draft interleaving ⊆ This framework PROPERTY 3: This framework ⊆ Probabilistic interleaving PROPERTY 4: The merged list is something “in between” the two
18. 18. Theoretical Benefits + PROPERTY 5: Breaking case in Balanced interleaving is omitted PROPERTY 6: Insensitivity in Team Draft interleaving is improved PROPERTY 7: Probabilistic interleaving will degrade more user experience
19. 19. Illustration An option to pursue is sensitivity L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0 Note: the number of constraint is 5, but unknown factor is 6? (it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}
20. 20. Agenda Basic concepts √ Previous algorithms √ Framework √ Invert Problem √ Refine Problem √ Theoretical benefits √ Illustration √ Evaluation Discussion
21. 21. Evaluation: summary Construct a dataset to simulate interleaving and user interact Evaluate Pearson correlation between each two algorithms. Analyze cases that algorithms disagree Evaluate result quality by experts Analyze bias and sensitivity among algorithms
22. 22. Evaluation +: construction of dataset Collect all query as well as top-4 results from a search engine Since the web and algorithm is changing, there are many distinct rankings for the same query. For each query, make sure that there’re at least 4 distinct rankings, each shown to user at least 10 times, with at least 1 click. The most frequent ranking sequence is regarded as A, a most dissimilar one is regarded as B. Further filter the log, so that results produced by either Balanced interleaving and Team Draft interleaving are frequent.
23. 23. Evaluation ++
24. 24. Evaluation +++
25. 25. Evaluation ++++ Bias comparison among different algorithms
26. 26. Evaluation +++++ Sensitivity comparison among different algorithms
27. 27. Agenda Basic concepts √ Previous algorithms √ Framework √ Invert Problem √ Refine Problem √ Theoretical benefits √ Illustration √ Evaluation √ Discussion
28. 28. Discussion Contribution in this paper: Invert the question of obtaining interleaving algorithms as a constrained optimization problem The solution is very intuitive, and general Many interesting examples to illustrate the breaking cases for previous approaches Note: The evaluation is simulated on logs from only one search engine. For interleaving, we’re expecting an evaluation based on different search engines? And that is why human evaluation result is not good among all algorithms.
29. 29. Discussion + “A and B are not shown to users as they have low sensitivity” This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83, which is high?
30. 30. Thank You !