8
Dynamic Collective Entity
Representations
Use “collective intelligence” to mine entity
descriptions to enrich representation.
Is like document expansion (add terms found
through explicit links)
Is not query expansion (terms found through
predicted links)
9
Advantages
Cheap: Change document in index, leverage tried &
tested retrieval algorithms
Free “smoothing”: (e.g., tweets) may capture ‘newly
evolving’ word associations (Ferguson shooting) and
incorporate out-of-document terms
“move relevant documents closer to queries” (= close
the gap between searcher vocabulary & docs in index)
10
Haven’t we seen this before?
Anchors & queries in particular have been shown to
improve retrieval [1]
Tweets have been shown to be similar to anchors [2]
Social tags, same [3]
But:
in batch (i.e., add data, see how it affects retrieval)
single source
[1] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC 2001
[2] G. Mishne and J. Lin. Twanchor text: A preliminary study of the value of tweets as anchor text. SIGIR ’12
[3] C.-J. Lee and W. B. Croft. Incorporating social anchors for ad hoc retrieval. OAIR ’13
13
Method: Adaptive ranking
Supervised single-field weighting model
Features:
field similarity: retrieval score per field.
field “importance”: length, novel terms, etc.
entity “importance”: time since last update.
(Re-)learn optimal weights from clicks
14
Experimental setup
1. Data:
MSN Query log (62,841 queries + clicks (on entities))
Each query is treated as a time unit
For each query:
Produce ranking
Observe click
Evaluate ranking (MAP/P@1)
Expand entities (w/ dynamic descriptions)
[re-train ranker]
15
Main results
Comparing effectiveness of diff. description
sources
Comparing adaptive vs. non-adaptive ranker
performance
19
In summary
Expanding entity representations with different
sources enables better matching of queries to
entities
As new content comes in, it is beneficial to retrain
the ranker
Informing ranker of “expansion state” further
improves performance
first entities & structure, i get to show the mandatory entity search example
you are not interested in documents
but in things: person/artist kendrick lamar
referring to him w/ his former stage name
so it is like web search, but the units of retrieval are real life entities, so we can collect data for them
This is what we try to leverage in this work
July 31st, after August 7th -> Added content, new words associations
this looks a bit extreme, because there’s swearing
but there’s a serious intuition here; vocabulary gap (formal KB, informal chatter)
our method aims to leverage this
enrich representation + close the gap
of collective int/descr sources
we look at a scenario where the expansions come in a streaming manner
Fielded document representation
You could do vanilla retrieval. But two challenges arise;
description sources differ along several dimensions (e.g., volume, quality, novelty)
head entities are likely to receive a larger number of external descriptions than tail entities.
content changes over time, so expansions may accumulate and “swamp” the representation
Our solution is to dynamically learn how to combine fields into single representation,
Features (more detail in paper);
field similarity features (per field) = query–field similarity scores.
field importance features (per field) to inform the ranker of the status of the field at that time (i.e., more and novel content)
entity importance (to favor “recently” updated entities)
(what about experimental setup?)
Took all queries that yield Wiki clicks.
Top-k retrieval, extract features
Allows to track performance over time
in this talk I focus on the contribution of sources
and adaptive vs. static ranker
1. Each source contributes to better ranking; Tags/web anchors do best, tweets are significantly > KB
2. Dynamic sources have higher “learning rates” (suggests that newly incoming data is successfully incorporated)
3. Tags starts under web but approaches it; new tags improve
[NEXT] To see the effect of incoming data, feature weights
- Static go down, dynamic go up (suggests retraining is important w/ dynamic expansions)
- Tweets marginally, but as we know KB+Tweets > KB, the tweets do help
- Not shown; static expansions stay roughly the same
[NEXT] Increasing field weight + increased performance suggests retraining is needed, next;
1. [LEFT] Lower performance overall (more data w/o more training queries)
2. [LEFT] Dynamic ones higher slopes; so newly incoming data does help even in static
3. [RIGHT] same patterns but tags+web do comparatively better (because of swamping?)
[END] higher performance: retraining increases ranker’s ability in optimally combining descriptions into a single representation
More data helps, but to optimally benefit you need to inform your ranker