More Related Content

Similar to Dynamic Collective Entity Representations for Entity Ranking(20)


More from David Graus(20)


Dynamic Collective Entity Representations for Entity Ranking

  1. Dynamic Collective Entity Representations for Entity Ranking David Graus, Manos Tsagkias, Wouter Weerkamp, Edgar Meij, Maarten de Rijke
  2. 2
  3. 3
  4. 4 Entity search?  Index = Knowledge Base (= Wikipedia)  Documents = Entities  “Real world entities” have a single representation (in KB)
  5. 5 Representation is not static  People talk about entities all the time  Associations between words and entities change over time
  6. 6 Example 1: News events
  7. 7 Example 2: Social media chatter
  8. 8 Dynamic Collective Entity Representations  Use “collective intelligence” to mine entity descriptions to enrich representation.  Is like document expansion (add terms found through explicit links)  Is not query expansion (terms found through predicted links)
  9. 9 Advantages  Cheap: Change document in index, leverage tried & tested retrieval algorithms  Free “smoothing”: (e.g., tweets) may capture ‘newly evolving’ word associations (Ferguson shooting) and incorporate out-of-document terms  “move relevant documents closer to queries” (= close the gap between searcher vocabulary & docs in index)
  10. 10 Haven’t we seen this before?  Anchors & queries in particular have been shown to improve retrieval [1]  Tweets have been shown to be similar to anchors [2]  Social tags, same [3]  But:  in batch (i.e., add data, see how it affects retrieval)  single source [1] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC 2001 [2] G. Mishne and J. Lin. Twanchor text: A preliminary study of the value of tweets as anchor text. SIGIR ’12 [3] C.-J. Lee and W. B. Croft. Incorporating social anchors for ad hoc retrieval. OAIR ’13
  11. 11 Description sourcesAnthropornis nordenskjoeldi Anthropornis Nordenskjoeld's Giant Penguin Eocene Oligocene Animal Chordate Aves Sphenisciformes Spheniscidae ... emperor penguin Nordenskjoeld's Giant Penguin Anthropornis nordenskjoeldi Nordenskjoeld's giant penguin Anthropornis Eocene birds Oligocene birds Extinct penguins Oligocene extinctions Bird genera KB Anchors KB Categories KB Redirects KB Links Anthropornis nordenskjoeldi Anthropornis nordenskjoeldi Web Anchors megafauna Tags Tweets biggest penguin anthropornis extinct penguin prehistoric birds Queries
  12. 12 Challenge  Heterogeneity 1. Description sources 2. Entities  Dynamic nature  Content changes over time
  13. 13 Method: Adaptive ranking  Supervised single-field weighting model  Features:  field similarity: retrieval score per field.  field “importance”: length, novel terms, etc.  entity “importance”: time since last update.  (Re-)learn optimal weights from clicks
  14. 14 Experimental setup 1. Data:  MSN Query log (62,841 queries + clicks (on entities))  Each query is treated as a time unit  For each query:  Produce ranking  Observe click  Evaluate ranking (MAP/P@1)  Expand entities (w/ dynamic descriptions)  [re-train ranker]
  15. 15 Main results  Comparing effectiveness of diff. description sources  Comparing adaptive vs. non-adaptive ranker performance
  16. 16 Description sources MAP No. of queries
  17. 17 Feature weights over time Relativefeatureimportance No. of queries
  18. 18 Non-adaptive vs. adaptive ranking
  19. 19 In summary  Expanding entity representations with different sources enables better matching of queries to entities  As new content comes in, it is beneficial to retrain the ranker  Informing ranker of “expansion state” further improves performance
  20. 20 Thank you  (Also, thank you WSDM & SIGIR travel grants)

Editor's Notes

  1. first entities & structure, i get to show the mandatory entity search example
  2. you are not interested in documents but in things: person/artist kendrick lamar referring to him w/ his former stage name
  3. so it is like web search, but the units of retrieval are real life entities, so we can collect data for them
  4. This is what we try to leverage in this work
  5. July 31st, after August 7th -> Added content, new words associations
  6. this looks a bit extreme, because there’s swearing but there’s a serious intuition here; vocabulary gap (formal KB, informal chatter)
  7. our method aims to leverage this enrich representation + close the gap
  8. of collective int/descr sources
  9. we look at a scenario where the expansions come in a streaming manner
  10. Fielded document representation
  11. You could do vanilla retrieval. But two challenges arise; description sources differ along several dimensions (e.g., volume, quality, novelty) head entities are likely to receive a larger number of external descriptions than tail entities. content changes over time, so expansions may accumulate and “swamp” the representation
  12. Our solution is to dynamically learn how to combine fields into single representation, Features (more detail in paper); field similarity features (per field) = query–field similarity scores. field importance features (per field) to inform the ranker of the status of the field at that time (i.e., more and novel content) entity importance (to favor “recently” updated entities) (what about experimental setup?)
  13. Took all queries that yield Wiki clicks. Top-k retrieval, extract features Allows to track performance over time
  14. in this talk I focus on the contribution of sources and adaptive vs. static ranker
  15. 1. Each source contributes to better ranking; Tags/web anchors do best, tweets are significantly > KB 2. Dynamic sources have higher “learning rates” (suggests that newly incoming data is successfully incorporated) 3. Tags starts under web but approaches it; new tags improve [NEXT] To see the effect of incoming data, feature weights
  16. - Static go down, dynamic go up (suggests retraining is important w/ dynamic expansions) - Tweets marginally, but as we know KB+Tweets > KB, the tweets do help - Not shown; static expansions stay roughly the same [NEXT] Increasing field weight + increased performance suggests retraining is needed, next;
  17. 1. [LEFT] Lower performance overall (more data w/o more training queries) 2. [LEFT] Dynamic ones higher slopes; so newly incoming data does help even in static 3. [RIGHT] same patterns but tags+web do comparatively better (because of swamping?) [END] higher performance: retraining increases ranker’s ability in optimally combining descriptions into a single representation
  18. More data helps, but to optimally benefit you need to inform your ranker