Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
1. Overview of the Living Labs for
IR Evaluation (LL4IR) CLEF Lab
http://living-labs.net
@livinglabsnet
“Give us your ranking, we’ll have it clicked!”
Krisztian Balog
University of Stavanger
Liadh Kelly
Trinity College Dublin
Anne Schuth
Blendle
7th International Conference of the CLEF Association (CLEF 2016) | Évora, Portugal, 2016
3. Motivation
- Overall goal: make information retrieval
evaluation more realistic
new retrieval methodusers live site
interaction
data
How to test a new method with real
users in their natural task
environment (i.e., on the live site)?
#1
How to make interaction data
available for method development?
#2
4. Key idea
new retrieval
methods
users live site
data
(docs/products,
logs, etc.)
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
API
5. Key idea
new retrieval
methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
An API orchestrates all data exchange
between the live site and experimental
systems#1
API
data
(docs/products,
logs, etc.)
6. Key idea
new retrieval
methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
Focus on frequent (head) queries.
- Ranked result lists can be generated offline
- Enough traffic on them (historical & live)#2
API
data
(docs/products,
logs, etc.)
7. Key idea
new retrieval
methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
Medium to large organizations with
fair amount of search volume
Typically lack their own R&D department#3
API
data
(docs/products,
logs, etc.)
8. Methodology
1. Queries, candidate documents, historical search and
click data made available
API
{
"queries": [
{
"creation_time": "Wed, 22 Apr 2015 09:15:41 -0000",
"qid": "R-q1",
"qstr": "monster high",
"type": "train"
},
{
"creation_time": "Wed, 22 Apr 2015 09:15:41 -0000",
"qid": "R-q51",
9. Methodology
1. Queries, candidate documents, historical search and
click data made available
API
{
"doclist": [
{
"docid": "R-d1291",
"site_id": "R",
"title": "LEGO DUPLO Hamupipu0151ke hintu00f3ja 6153"
},
{
"docid": "R-d1306",
"site_id": "R",
"title": "LEGO Rendu0151rkapitu00e1nysu00e1g 5681"
10. Methodology
1. Queries, candidate documents, historical search and
click data made available
API
{
"content": {
"age_max": 3,
"age_min": 1,
"arrived": "2014-08-28",
"available": 0,
"brand": "Lego",
"category": "LEGO",
"category_id": "38",
"characters": [],
"description": "Lego Duplo - u00c9pu00edtu0151-u00e9s j
11. Methodology
2. Rankings are generated for each query and uploaded
through an API
API
{
"qid": "U-q22",
"runid": "82"
"creation_time": "Wed, 04 Jun 2014 15:03:56 -0000",
"doclist": [
{
"docid": "U-d4"
},
{
"docid": "U-d2"
}, ...
12. Methodology
3. When any of the test queries is fired, the live site
request rankings from the API and interleaves them
with that of the production system
API
13. Interleaving
- Site provides the set of candidate items that can be
re-ranked (safety mechanism)
- Experimental ranking is interleaved with the
production ranking
- Meeds 1-2 order of magnitudes data than A/B testing (also,
it is within subject as opposed to between subject design)
doc 1
doc 2
doc 3
doc 4
doc 5
doc 2
doc 4
doc 7
doc 1
doc 3
system A system B
doc 1
doc 2
doc 4
doc 3
doc 7
interleaved list
A>B
Inference:
14. Methodology
4. Participants get detailed feedback on user
interactions (clicks)
API
{
"feedback": [
{
"qid": "S-q1",
"runid": "baseline",
"type": "tdi",
"doclist": [
{
"docid": "S-d1",
"clicked": true,
"team": "site",
15. Methodology
5. Ultimate measure is the number of “wins” against the
production system (aggregated over a period of time)
Outcome =
#Wins
#Wins + #Losses
16. What is in it for
participants?
- Access to privileged commercial data
- (Search and click-through data)
- Opportunity to test IR systems with real,
unsuspecting users in a live setting
- (Not the same as crowdsourcing!)
- (Continuous evaluation is possible, not limited to
yearly evaluation cycle)
23. Benchmark organization
training period test period
query
type
train
- feedback available
- individual feedback
- update possible
test
- feedback available
- no individual feedback
- update possible
- no feedback available
- no individual feedback
- update not possible
24. Product search
- Ad-hoc retrieval over a product catalog
- Several thousand products
- Limited amount of text, lots of structure
- Categories, characters, brands, etc.
26. Product data Product name
Price / bonus price
Short
description
Recommended
age from/to
Gender
recommendation
Categories
Brands
Long
description
(Links to) photos
27. {
"content": {
"age_max": 10,
"age_min": 6,
"arrived": "2014-08-28",
"available": 1,
"brand": "Mattel",
"category": "Babu00e1k, kellu00e9kek",
"category_id": "25",
"characters": [],
"description": "A Monster Highu00ae iskola szu00f6rnycsemetu00e9i […]",
"gender": 2,
"main_category": "Baba, babakocsi",
"main_category_id": "3",
"photos": [
"http://regiojatek.hu/data/regio_images/normal/20777_0.jpg",
"http://regiojatek.hu/data/regio_images/normal/20777_1.jpg",
[…]
],
"price": 8675.0,
"product_name": "Monster High Scaris Paravu00e1rosi baba tu00f6bbfu00e9le",
"queries": {
"clawdeen": "0.037",
"monster": "0.222",
"monster high": "0.741"
},
"short_description": "A Monster Highu00ae iskola szu00f6rnycsemetu00e9i
elsu0151 ku00fclfu00f6ldi u00fatjukra indulnak..."
},
"creation_time": "Mon, 11 May 2015 04:52:59 -0000",
"docid": "R-d43",
"site_id": "R",
"title": "Monster High Scaris Paravu00e1rosi baba tu00f6bbfu00e9le"
}
Frequent queries that
led to the product
28. Queries
- Typically very short
monster high
magnetiz
duplo
lego friends
geomag
trash+pack
barbie
monopoly
lego duplo
transformers
star wars
nerf
carrera
baba
30. Inventory changes
New arrival
Became available
Became unavailable
Days
#Products
−40−20020406080−40−20020406080
05−01 05−03 05−05 05−07 05−09 05−11 05−13 05−15
32. Summary
- Successes
- Experimental methodology
- Many interesting opportunities to address current limitations
(come to NewsREEL & LL4IR session tomorrow)
- The living labs platform
- Open source, can be used for a variety of tasks
- Some interesting work for product search
- See best of the labs session
- Lack of success
- Raise sufficient interest in the use-cases at CLEF
33. Limitations / Open issues
- Head queries only: Considerable portion of traffic,
but only popular info needs
- Lack of context: No knowledge of the searcher’s
location, previous searches, etc.
- No real-time feedback: API provides detailed
feedback, but it’s not immediate
- Limited control: Experimentation is limited to single
searches, where results are interleaved with those of
the production system; no control over the entire
result list
- Ultimate measure of success: Search is only a
means to an end, it is not the ultimate goal
34. TREC Open Search
http://trec-open-search.org/
- Use-case: academic search
- Ad-hoc document search
- Sites
- CiteSeerX
- SSOAR — German Social Sciences
- Microsoft Academic Search
- Round #3 runs from Oct 1 to Nov 15