Optimized interleaving for online retrieval evaluation

Optimized Interleaving for Online Retrieval Evaluation
(Best paper in WSDM’13)

Author: Filip Radlinski,

Nick Craswell
Slides By: Han Jiang

Agenda
Basic concepts
Previous algorithms
Framework
Invert Problem
Refine Problem
Theoretical benefits
Illustration

Evaluation
Discussion

Basic concepts
What is interleaving?
Merge results from different retrieval algorithms.
Only a combined list is shown to user.
The quality of algorithms can be infered with the help of
clickthrough data.

Interleaved list

Search Engine A

Search Engine B

Source List A

Query

Source List B

Interleaving Algorithm
Assignment

Clicks

Credit function

Evaluation
Result

Basic concepts +

Ah, that’s easy…how about:
Interleaving method = pickup best results from each algorithms?
Wait… how do we know whether d 1 is better than d4?

OK, then toss a coin instead, and
Credit function = if di is clicked and higher in ranker A, prefer A.
Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…

Basic concepts ++
So, what is a good interleaving algorithm?
Intuitively*, a good one should:
Be blind to user. Be blind to retrieval functions.
Be robust to biases in the user’s decision process (that do not relate to retrieval quality)
Not substantially alter the search experience
Lead to clicks that reflect the user’s preference

[*] Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02

Agenda
Basic concepts √
Previous algorithms
Framework
Invert Problem
Refine Problem
Illustration

Evaluation
Discussion

Previous Algorithms
Balanced Interleaving
toss a coin once, pick up best items by turns.

Team Draft Interleaving
toss a coin every two times, pick up best item from winner first

Probabilistic Interleaving
toss a coin every time, sample item from winner

A weight function ensures that doc in higher rank
has higher probability to be picked up

Previous Algorithms +
About credit functions, only documents that are clicked by users
are considered
Balanced Interleaving (coin=A)
A:
B:
B:
B:

d1
1
d44
d
d4

d2
2
d11
d
d1

d3
3
d22
d
d2

d4A wins
4
d
d33

M: d1 d4 d2 d3
clicks on: d1 d3

Team Draft Interleaving (coin=AA)
A: d1 d d3 d4
A: d1 d22 d3 d4
B: d4 d1 d2 d3
B: d4 d1 d2 d3

tie

M: d1 d4 d2 d3
clicks on: d1 d3

Probabilistic Interleaving (possible coin=AA, AB)
A: d1 d2 d3 d4
1
2
3
4
B: d4 d1 d2 d3
4
1
2
3

A: d1 d2 d3 d4
1
2
3
4
B: d4 d1 d2 d3
4
1
2
3

M: d1 d4 d2 d3
clicks on: d1 d3

A wins with p=100%

Agenda
Basic concepts √
Previous algorithms √
Framework
Invert Problem
Refine Problem
Illustration

Evaluation
Discussion

Invert the problem
Why previous algorithms are not good enough:
Balanced interleaving & Team Draft interleaving: biased
Even a random click on the document raises up a winner.

Probabilistic interleaving: degrading the user experience
blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1)

Therefore, the problem of interleaving should be more constrained

A good way is to start from the principles…

Refine the problem
Again, what is a good interleaving algorithm?
Be robust to biases in the user’s decision process (that do not relate
to retrieval quality)

Not substantially alter the search experience (show one of the rankings,
or a ranking “in between” the two)

preference:
Lead to clicks that reflect the user’s preference
If document d is clicked, the input ranker that ranked d higher is given more credit
A randomly clicking user doesn’t create a preference for either ranker

Be sensitive to input data (fewest user queries show significant preference)

Refine the problem +
Again, what is a good interleaving algorithm?
Be robust to biases in the user’s decision process (that do not relate
to retrieval quality)

Not substantially alter the search experience (show one of the rankings,
or a ranking “in between” the two)

Lead to clicks that reflect the user’s preference:


Refine the problem ++
Not substantially alter the search experience (show one of the
rankings, or a ranking “in between” the two)

A=(d1, d2), B=(d1,d2), M = (d1, d2)

Lead to clicks that reflect the user’s preference:


a possible interleaved list
under previous constraints
length of list
num of clicks

score function, when >0, assign
score to A, otherwise to B

Refine the problem +++

Refine the problem ++++
So the constraint is:

And target is:

With variable: the definition of

Define predict function: δ
Linear Rank difference:

Inverse Rank:

Since it is a optimization problem, the existence of solution should be
guaranteed theoretically. While in the paper it is only guaranteed
empirically.

Theoretical Benefits
PROPERTY 1:

Balanced interleaving ⊆ This framework

PROPERTY 2:

Team Draft interleaving ⊆ This framework

PROPERTY 3:

This framework ⊆ Probabilistic interleaving

PROPERTY 4:

The merged list is something “in between” the two

Theoretical Benefits +

PROPERTY 5:

Breaking case in Balanced interleaving is omitted

PROPERTY 6:

Insensitivity in Team Draft interleaving is improved

PROPERTY 7:

Probabilistic interleaving will degrade more user experience

Illustration

An option to pursue is sensitivity

L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0

Note: the number of constraint is 5, but unknown factor is 6?
(it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}

Agenda
Basic concepts √
Framework √
Invert Problem √
Refine Problem √
Theoretical benefits √
Illustration √

Evaluation
Discussion

Evaluation: summary
Construct a dataset to simulate interleaving and user interact
Evaluate Pearson correlation between each two algorithms.
Analyze cases that algorithms disagree
Evaluate result quality by experts
Analyze bias and sensitivity among algorithms

Evaluation +: construction of dataset
Collect all query as well as top-4 results from a search engine
Since the web and algorithm is changing, there are many distinct
rankings for the same query.
For each query, make sure that there’re at least 4 distinct
rankings, each shown to user at least 10 times, with at least 1
click.
The most frequent ranking sequence is regarded as A, a most
dissimilar one is regarded as B.
Further filter the log, so that results produced by either Balanced
interleaving and Team Draft interleaving are frequent.

Evaluation ++++

Bias comparison among different algorithms

Evaluation +++++

Sensitivity comparison among different algorithms

Agenda
Basic concepts √
Framework √
Invert Problem √
Refine Problem √
Theoretical benefits √
Illustration √

Evaluation √
Discussion

Discussion
Contribution in this paper:
Invert the question of obtaining interleaving
algorithms as a constrained optimization problem
The solution is very intuitive, and general
Many interesting examples to illustrate the breaking cases for
previous approaches
Note:
The evaluation is simulated on logs from only one search engine.
For interleaving, we’re expecting an evaluation based on different search engines?
And that is why human evaluation result is not good among all algorithms.

Discussion +

“A and B are not shown to users as they have low sensitivity”
This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83,
which is high?

Optimized interleaving for online retrieval evaluation

Recommended

Recommended

More Related Content

Similar to Optimized interleaving for online retrieval evaluation

Similar to Optimized interleaving for online retrieval evaluation (20)

Recently uploaded

Recently uploaded (20)

Optimized interleaving for online retrieval evaluation

Editor's Notes