Need some explanation: e.g. rank*(d, A) = position of d in A, or |A|+1 if doesn’t exist So for any pair <i,j>, with (i<j), pickup a pair in L as: p = Li, q = Lj. it is supposed to see: rank*(p, A) <= rank*(q, A) && rank*(p, B) <= rank*(q, B): no misorder in A,B rank*(p, A) > rank*(q, A) && rank*(p, B) > rank*(q, B): not possible, that means (d1,d2) & (d1,d2) creates (d2,d1) rank*(p, A) > rank*(q, A) && rank*(p, B) <= rank*(q, B): misorder, also misorder in A, B rank*(p, A) <= rank*(q, A) && rank*(p, B) > rank*(q, B): misorder, also misorder in A, B
Breaking case comes when one of the rankings is preferred more often than another, this is omitted by sum(p) = 0 constraint Insensitivity comes because weight of position is not taken into consideration when doing evaluation Property 7 is guaranteed by property 4
To maximize sensitivity, we might be able to solve the problem with less constraints? Seems that the author enforce L != A and L != B, so that we get fewer unknown factor?
1. Optimized Interleaving for Online Retrieval Evaluation
(Best paper in WSDM’13)
Author: Filip Radlinski,
Nick Craswell
Slides By: Han Jiang
2. Agenda
Basic concepts
Previous algorithms
Framework
Invert Problem
Refine Problem
Theoretical benefits
Illustration
Evaluation
Discussion
3. Basic concepts
What is interleaving?
Merge results from different retrieval algorithms.
Only a combined list is shown to user.
The quality of algorithms can be infered with the help of
clickthrough data.
Interleaved list
Search Engine A
Search Engine B
Source List A
Query
Source List B
Interleaving Algorithm
Assignment
Clicks
Credit function
Evaluation
Result
4. Basic concepts +
Ah, that’s easy…how about:
Interleaving method = pickup best results from each algorithms?
Wait… how do we know whether d 1 is better than d4?
OK, then toss a coin instead, and
Credit function = if di is clicked and higher in ranker A, prefer A.
Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…
5. Basic concepts ++
So, what is a good interleaving algorithm?
Intuitively*, a good one should:
Be blind to user. Be blind to retrieval functions.
Be robust to biases in the user’s decision process (that do not relate to retrieval quality)
Not substantially alter the search experience
Lead to clicks that reflect the user’s preference
[*] Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02
6. Agenda
Basic concepts √
Previous algorithms
Framework
Invert Problem
Refine Problem
Theoretical benefits
Illustration
Evaluation
Discussion
7. Previous Algorithms
Balanced Interleaving
toss a coin once, pick up best items by turns.
Team Draft Interleaving
toss a coin every two times, pick up best item from winner first
Probabilistic Interleaving
toss a coin every time, sample item from winner
A weight function ensures that doc in higher rank
has higher probability to be picked up
8. Previous Algorithms +
About credit functions, only documents that are clicked by users
are considered
Balanced Interleaving (coin=A)
A:
B:
B:
B:
d1
1
d44
d
d4
d2
2
d11
d
d1
d3
3
d22
d
d2
d4A wins
4
d
d33
M: d1 d4 d2 d3
clicks on: d1 d3
Team Draft Interleaving (coin=AA)
A: d1 d d3 d4
A: d1 d22 d3 d4
B: d4 d1 d2 d3
B: d4 d1 d2 d3
tie
M: d1 d4 d2 d3
clicks on: d1 d3
Probabilistic Interleaving (possible coin=AA, AB)
A: d1 d2 d3 d4
1
2
3
4
B: d4 d1 d2 d3
4
1
2
3
A: d1 d2 d3 d4
1
2
3
4
B: d4 d1 d2 d3
4
1
2
3
M: d1 d4 d2 d3
clicks on: d1 d3
A wins with p=100%
9. Agenda
Basic concepts √
Previous algorithms √
Framework
Invert Problem
Refine Problem
Theoretical benefits
Illustration
Evaluation
Discussion
10. Invert the problem
Why previous algorithms are not good enough:
Balanced interleaving & Team Draft interleaving: biased
Even a random click on the document raises up a winner.
Probabilistic interleaving: degrading the user experience
blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1)
Therefore, the problem of interleaving should be more constrained
A good way is to start from the principles…
11. Refine the problem
Again, what is a good interleaving algorithm?
Be blind to user. Be blind to retrieval functions.
Be robust to biases in the user’s decision process (that do not relate
to retrieval quality)
Not substantially alter the search experience (show one of the rankings,
or a ranking “in between” the two)
preference:
Lead to clicks that reflect the user’s preference
If document d is clicked, the input ranker that ranked d higher is given more credit
A randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data (fewest user queries show significant preference)
12. Refine the problem +
Again, what is a good interleaving algorithm?
Be blind to user. Be blind to retrieval functions.
Be robust to biases in the user’s decision process (that do not relate
to retrieval quality)
Not substantially alter the search experience (show one of the rankings,
or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:
If document d is clicked, the input ranker that ranked d higher is given more credit
A randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data (fewest user queries show significant preference)
13. Refine the problem ++
Not substantially alter the search experience (show one of the
rankings, or a ranking “in between” the two)
A=(d1, d2), B=(d1,d2), M = (d1, d2)
Lead to clicks that reflect the user’s preference:
If document d is clicked, the input ranker that ranked d higher is given more credit
A randomly clicking user doesn’t create a preference for either ranker
a possible interleaved list
under previous constraints
length of list
num of clicks
score function, when >0, assign
score to A, otherwise to B
14. Refine the problem +++
Be sensitive to input data (fewest user queries show significant preference)
15. Refine the problem ++++
So the constraint is:
And target is:
With variable: the definition of
16. Define predict function: δ
Linear Rank difference:
Inverse Rank:
Since it is a optimization problem, the existence of solution should be
guaranteed theoretically. While in the paper it is only guaranteed
empirically.
17. Theoretical Benefits
PROPERTY 1:
Balanced interleaving ⊆ This framework
PROPERTY 2:
Team Draft interleaving ⊆ This framework
PROPERTY 3:
This framework ⊆ Probabilistic interleaving
PROPERTY 4:
The merged list is something “in between” the two
18. Theoretical Benefits +
PROPERTY 5:
Breaking case in Balanced interleaving is omitted
PROPERTY 6:
Insensitivity in Team Draft interleaving is improved
PROPERTY 7:
Probabilistic interleaving will degrade more user experience
19. Illustration
An option to pursue is sensitivity
L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0
Note: the number of constraint is 5, but unknown factor is 6?
(it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}
21. Evaluation: summary
Construct a dataset to simulate interleaving and user interact
Evaluate Pearson correlation between each two algorithms.
Analyze cases that algorithms disagree
Evaluate result quality by experts
Analyze bias and sensitivity among algorithms
22. Evaluation +: construction of dataset
Collect all query as well as top-4 results from a search engine
Since the web and algorithm is changing, there are many distinct
rankings for the same query.
For each query, make sure that there’re at least 4 distinct
rankings, each shown to user at least 10 times, with at least 1
click.
The most frequent ranking sequence is regarded as A, a most
dissimilar one is regarded as B.
Further filter the log, so that results produced by either Balanced
interleaving and Team Draft interleaving are frequent.
23. Evaluation ++
24. Evaluation +++
25. Evaluation ++++
Bias comparison among different algorithms
26. Evaluation +++++
Sensitivity comparison among different algorithms
28. Discussion
Contribution in this paper:
Invert the question of obtaining interleaving
algorithms as a constrained optimization problem
The solution is very intuitive, and general
Many interesting examples to illustrate the breaking cases for
previous approaches
Note:
The evaluation is simulated on logs from only one search engine.
For interleaving, we’re expecting an evaluation based on different search engines?
And that is why human evaluation result is not good among all algorithms.
29. Discussion +
“A and B are not shown to users as they have low sensitivity”
This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83,
which is high?
Be the first to comment