Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Entity Linking

2,487 views

Published on

Lecture given at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016)

Published in: Technology

Entity Linking

  1. 1. Entity Linking 10th Russian Summer School in Information Retrieval (RuSSIR 2016) | Saratov, Russia, 2016 Krisztian Balog University of Stavanger
 @krisztianbalog
  2. 2. From Named Entity Recognition to Entity Linking
  3. 3. Named entity recognition (NER) - Also known as entity identification, entity extraction, and entity chunking - Task: identifying named entities in text and labeling them with one of the possible entity types - Person (PER), organization (ORG), location (LOC), miscellaneous (MISC). Sometimes also temporal expressions (TIMEX) and certain types of numerical expressions (NUMEX) <LOC>Silicon Valley</LOC> venture capitalist <PER>Michael Moritz</PER> said that today's billion-dollar "unicorn" startups can learn from <ORG>Apple</ORG> founder <PER>Steve Jobs</PER>
  4. 4. Named entity disambiguation - Also called named entity normalization and named entity resolution - Task: assign ambiguous entity names to canonical entities from some catalog - It is usually assumed that entities have already been recognized in the input text (i.e., it has been processed by a NER system)
  5. 5. Wikification - Named entity disambiguation using Wikipedia as the catalog of entities - Also annotating concepts, not only entities
  6. 6. Entity linking - Task: recognizing entity mentions in text and linking them to the corresponding entries in a knowledge base (KB) - Limited to recognizing entities for which a target entry exists in the reference KB; each KB entry is a candidate - It is assumed that the document provides sufficient context for disambiguating entities - Knowledge base (working definition): - A catalog of entities, each with one or more names (surface forms), links to other entities, and, optionally, a textual description - Wikipedia, DBpedia, Freebase, YAGO, etc.
  7. 7. Overview of entity annotation tasks Task Recognition Assignment Named entity recognition entities entity type Named entity disambiguation entities entity ID / NIL Wikification entities and concepts entity ID / NIL Entity linking entities entity ID
  8. 8. Entity linking in action
  9. 9. Entity linking in action
  10. 10. Anatomy of an entity linking system Mention detection Candidate selection Disambiguation entity annotations document Identification of text snippets that can potentially be linked to entities Generating a set of candidate entities for each mention Selecting a single entity (or none) for each mention, based on the context
  11. 11. Mention detection Mention detection Candidate selection Disambiguation entity annotations document
  12. 12. Mention detection - Goal: Detect all “linkable” phrases - Challenges: - Recall oriented - Do not miss any entity that should be linked - Find entity name variants - E.g. “jlo” is name variant of [Jennifer Lopez] - Filter out inappropriate ones - E.g. “new york” matches >2k different entities
  13. 13. Common approach 1. Build a dictionary of entity surface forms - Entities with all names variants 2. Check all document n-grams against the dictionary - The value of n is set typically between 6 and 8 3. Filter out undesired entities - Can be done here or later in the pipeline
  14. 14. Example Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites, Entities (Es)Surface form (s) … … Times_Square Times_Square_(Hong_Kong) Times_Square_(IRT_42nd_Street_Shuttle) … Times Square …… Empire State Building Empire_State_Building Empire State Empire_State_(band) Empire_State_Building Empire_State_Film_Festival Empire British_Empire Empire_(magazine) First_French_Empire Galactic_Empire_(Star_Wars) Holy_Roman_Empire Roman_Empire … New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
  15. 15. Surface form dictionary construction 
 from Wikipedia - Page title - Canonical (most common) name of the entity
  16. 16. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Alternative names that are frequently used to refer to an entity
  17. 17. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - List of entities that share the same name
  18. 18. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - Anchor texts - of links pointing to the entity's Wikipedia page
  19. 19. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - Anchor texts - Bold texts from first paragraph - generally denote other name variants of the entity
  20. 20. Surface form dictionary construction 
 from other sources - Anchor texts from external web pages pointing to Wikipedia articles - Problem of synonym discovery - Expanding acronyms - Leveraging search results or query-click logs from a web search engine - ...
  21. 21. Filtering mentions - Filter our mentions that are unlikely to be linked to any entity - Keyphraseness: - Link probability: P(keyphrase|m) = |Dlink(m)| |D(m)| P(link|m) = link(m) freq(m) number of Wikipedia articles where m appears as a link number of Wikipedia articles that contain m the number of times mention m appears as a link total number of times mention m occurs in Wikipedia (as a link or not)
  22. 22. Overlapping entity mentions - Dealing with them in this phase - E.g., by dropping a mention if it is subsumed by another mention - Keeping them and postponing the decision to a later stage (candidate selection or disambiguation)
  23. 23. Candidate selection Mention detection Candidate selection Disambiguation entity annotations document
  24. 24. Candidate selection - Goal: Narrow down the space of disambiguation possibilities - Balances between precision and recall (effectiveness vs. efficiency) - Often approached as a ranking problem - Keeping only candidates above a score/rank threshold for downstream processing
  25. 25. Commonness - Perform the ranking of candidate entities based on their overall popularity, i.e., "most common sense" P(e|m) = n(m, e) P e0 n(m, e0) the number of times entity e is the link destination of mention m total number of times mention m appears as a link
  26. 26. Example Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites, New York City is a fast-paced, globally influential center of art, culture, fashion and finance. Times_Square_(IRT_42nd_Street_Shuttle) 0.006 … … Commonness (P(e|m)) Entity (e) 0.011Times_Square_(Hong_Kong) 0.017Times_Square_(film) Times_Square 0.940
  27. 27. Commonness - Can be pre-computed and stored in the entity surface form dictionary - Follows a power law with a long tail of extremely unlikely senses; entities at the tail end of the distribution can be safely discarded - E.g., 0.001 is a sensible threshold
  28. 28. Example Entity Commonness Germany 0.9417 Germany_national_football_team 0.0139 Nazi_Germany 0.0081 German_Empire 0.0065 ... Entity Commonness FIFA_World_Cup 0.2358 FIS_Apline_Ski_World_Cup 0.0682 2009_FINA_Swimming_World_Cup 0.0633 World_Cup_(men's_golf) 0.0622 ... Entity Commonness 1998_FIFA_World_Cup 0.9556 1998_IAAF_World_Cup 0.0296 1998_Alpine_Skiing_World_Cup 0.0059 ... Bulgaria's best World Cup performance was in the 1994 World Cup where they beat Germany, to reach the semi-finals, losing to Italy, and finishing in fourth ...
  29. 29. Disambiguation Mention detection Candidate selection Disambiguation entity annotations document
  30. 30. Disambiguation - Baseline approach: most common sense - Consider additional types of evidence - Prior importance of entities and mentions - Contextual similarity between the text surrounding the mention and the candidate entity - Coherence among all entity linking decisions in the document - Combine these signals - Using supervised learning or graph-based approaches - Optionally perform pruning - Reject low confidence or semantically meaningless annotations
  31. 31. Prior importance features - Context-independent features - Neither the text nor other mentions in the document are taken into account - Keyphraseness - Link probability - Commonness
  32. 32. Prior importance features - Link prior - Popularity of the entity measured in terms of incoming links - Page views - Popularity of the entity measured in terms traffic volume Plink(e) = link(e) P e0 link(e0) Ppageviews(e) = pageviews(e) P e0 pageviews(e0)
  33. 33. Contextual features - Compare the surrounding context of a mention with the (textual) representation of the given candidate entity - Context of a mention - Window of text (sentence, paragraph) around the mention - Entire document - Entity's representation - Wikipedia entity page, first description paragraph, terms with highest TF-IDF score, etc. - Entity's description in the knowledge base
  34. 34. Contextual similarity - Commonly: bag-of-words representation - Cosine similarity - Many other options for measuring similarity - Dot product, KL divergence, Jaccard similarity - Representation does not have to be limited to bag-of-words - Concept vectors (named entities, Wikipedia categories, anchor text, keyphrases, etc.) simcos(m, e) = ~dm · ~de k ~dm kk ~de k
  35. 35. Entity-relatedness features - It can reasonably be assumed that a document focuses on one or at most a few topics - Therefore, entities mentioned in a document should be topically related to each other - Capturing topical coherence by developing some measure of relatedness between (linked) entities - Defined for pairs of entities
  36. 36. Wikipedia Link-based Measure (WSM) - Often referred to simply as relatedness - A close relationship is assumed between two entities if there is a large overlap between the entities linking to them WLM(e, e0 ) = 1 log(max(|Le|, |Le0 |)) log(|Le Le0 |) log(|E|) log(min(|Le|, |Le0 |)) set of entities that link to etotal number of entities
  37. 37. Wikipedia Link-based Measure (WLM) Gabrilovich and Markovitch (2007) achieve extremely accurate results with ESA, a technique that is somewhat reminiscent of the vector space model widely used in information retrieval. Instead of comparing vectors of term weights to evaluate the similarity between queries and documents, they compare weighted vectors of the Wikipedia articles related to each term. The name of the approach—Explicit Semantic Analysis—stems from the way these vectors are comprised of manually defined of explicitly defined semantics. Despite the name, Explicit Semantic Analysis takes advantage of only one property: the way in which Wikipedia’s text is segmented into individual topics. It’s central component—the weight between a term and an article—is automatically derived rather than explicitly specified. In contrast, the central component of our approach is the link: a manually-defined connection between two manually disambiguated concepts. Wikipedia provides millions of these connections, as Global WarmingAutomobile Petrol Engine Fossil Fuel 20th Century Emission Standard Bicycle Diesel Engine Carbon Dioxide Air Pollution Greenhouse Gas Alternative Fuel Transport Vehicle Henry Ford Combustion Engine Kyoto Protocol Ozone Greenhouse Effect Planet Audi Battery (electricity) Arctic Circle Environmental Skepticism Greenpeace Ecology incoming links outgoing links Figure 1: Obtaining a semantic relatedness measure between Automobile and Global Warming from Wikipedia links. incoming links outgoing links 26 Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In AAAI WikiAI Workshop.
  38. 38. Asymmetric relatedness features - A relatedness function does not have to be symmetric - E.g., the relatedness of the UNITED STATES given NEIL ARMSTRONG is intuitively larger than the relatedness of NEIL ARMSTRONG given the UNITED STATES - Conditional probability P(e0 |e) = |Le0 Le| |Le|
  39. 39. Entity-relatedness features - Numerous ways to define relatedness - Consider not only incoming, but also outgoing links or the union of incoming and outgoing links - Jaccard similarity, Pointwise Mutual Information (PMI), or the Chi- square statistic, etc. - Having a single relatedness function is preferred, to keep the disambiguation process simple - Various relatedness measures can effectively be combined into a single score using a machine learning approach [Ceccarelli et al., 2013]
  40. 40. Overview of features
  41. 41. Disambiguation approaches - Consider local compatibility (including prior evidence) and coherence with the other entity linking decisions - Task: - Objective function: ⇤ = arg max X (m,e)2 (m, e) + ( ) : Md ! E [ {;} coherence function for all entity annotations in the document local compatibility between the mention and the assigned entity This optimization problem is NP-hard! Need to resort to approximation algorithms and heuristics
  42. 42. Disambiguation strategies - Individually, one-mention-at-a-time - Rank candidates for each mention, take the top ranked one (or NIL) - Interdependence between entity linking decisions may be incorporated in a pairwise fashion - Collectively, all mentions in the document jointly (m) = arg max e2Em score(m, e)
  43. 43. Disambiguation approaches Approach Context Entity interdependence Most common sense none none Individual local disambiguation text none Individual global disambiguation text & entities pairwise Collective disambiguation text & entities collective
  44. 44. Individual local disambiguation - Early entity linking approaches - Local compatibility score can be written as a linear combination of features - Learn the "optimal" combination of features from training data using machine learning (e, m) = X i ifi(e, m) Can be both context-independent and context-dependent features
  45. 45. Individual global disambiguation - Consider what other entities are mentioned in the document - True global optimization would be NP-hard - Good approximation can be computed efficiently by considering pairwise interdependencies for each mention independently - Pairwise entity relatedness scores need to be aggregated into a single number (how coherent the given candidate entity is with the rest of the entities in the document)
  46. 46. Approach #1: GLOW
 [Ratinov et al., 2011] - First perform local disambiguation (E*) - The coherence of entity e with all other linked entities in the document: - Use the predictions in a second, global disambiguation round gj(e, E? ) = [ X e02E?{e} fj(e, e0 )] aggregator function (max or avg) particular relatedness function score(m, e) = X i ifi(e, m) + X j jgj(e, E? )
  47. 47. Approach #2: TAGME
 [Ferragina & Scaiella, 2010] - Combine the two most important features (commonness and relatedness) using a voting scheme - The score of a candidate entity for a particular mention: - The vote function estimates the agreement between e and all candidate entities of all other mentions in the document score(m, e) = X m02Md{m} vote(m0 , e)
  48. 48. TAGME (voting mechanism) - Average relatedness between each possible disambiguation, weighted by its commonness score vote(m0 , e) = P e02Em0 WLM(e, e0 )P(e0 |m0 ) |Em0 | entity entity mention mention m entity entity entity e entity entity e’ mention m’ entity P(e’|m’) WLM(e,e’)
  49. 49. TAGME (final score) - Final decision uses a simple but robust heuristic - The top entities with the highest score are considered for a given mention and the one with the highest commonness score is selected (m) = arg max e2Em {P(e|m) : e 2 top✏[score(m, e)]} score(m, e) = X m02Md{m} vote(m0 , e)
  50. 50. Collective disambiguation - Graph-based representation - Mention-entity edges capture the local compatibility between the mention and the entity - Measured using a combination of context-independent and context- dependent features - Entity-entity edges represent the semantic relatedness between a pair of entities - Common choice is relatedness (WLM) - Use these relations jointly to identify a single referent entity (or none) for each of the mentions
  51. 51. Example edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions
  52. 52. AIDA [Hoffart et al., 2011] - Problem formulation: find a dense subgraph that contains all mention nodes and exactly one mention-entity edge for each mention - Greedy algorithm iteratively removes edges
  53. 53. Algorithm - Start with the full graph - Iteratively remove the entity node with the lowest weighted degree (along with all its incident edges), provided that each mention node remains connected to at least one entity - Weighted degree of an entity node is the sum of the weights of its incident edges - The graph with the highest density is kept as the solution - The density of the graph is measured as the minimum weighted degree among its entity nodes
  54. 54. Example iteration #1 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions 0.01 weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx What is the density of the graph? 0.03
  55. 55. Example iteration #2 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx x x What is the density of the graph? 0.12
  56. 56. Example iteration #3 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx x x x x What is the density of the graph? 0.86
  57. 57. Pre- and post-processing - Pre-processing phase: remove entities that are "too distant" from the mention nodes - At the end of the iterations, the solution graph may still contain mentions that are connected to more than one entity; deal with this in post-processing - If the graph is sufficiently small, it is feasible to exhaustively consider all possible mention-entity pairs - Otherwise, a faster local (hill-climbing) search algorithm may be used
  58. 58. Random walk with restarts [Han et al., 2011] - "A random walk with restart is a stochastic process to traverse a graph, resulting in a probability distribution over the vertices corresponding to the likelihood those vertices are visited" [Guo et al., 2014] - Starting vector holding initial evidence v(m) = tfidf(m) P m0inMd tfidf(m0)
  59. 59. Random walk process - T is evidence propagation matrix - Evidence propagation ratio - Evidence propagation T(m ! e) = w(m ! e) P e02Em w(m ! e0) T(e ! e0 ) = w(e ! e0 ) P e002Ed w(e ! e00) r0 = v rt+1 = (1 ↵) ⇥ rt ⇥ T + ↵ ⇥ v vector holding the probability that the nodes are visited restart probability (0.1) initial evidence
  60. 60. Final decision - Once the random walk process has converged to a stationary distribution, the referent entity for each mention (m) = arg max e2Em (m, e) ⇥ r(e) local compatibility (e, m) = X i ifi(e, m)
  61. 61. Pruning - Discarding meaningless or low-confidence annotations produced by the disambiguation phase - Simplest solution: use a confidence threshold - More advanced solutions - Machine learned classifier to retain only entities that are ``relevant enough'' (human editor would annotate them) [Milne & Witten, 2008] - Optimization problem: decide, for each mention, whether switching the top ranked disambiguation to NIL would improve the objective function [Ratinov et al., 2011]
  62. 62. Entity linking systems
  63. 63. Publicly available entity linking systems System Reference KB Online demo Web API Source code AIDA
 http://www.mpi-inf.mpg.de/yago-naga/aida/
 YAGO2 Yes Yes Yes (Java) DBpedia Spotlight
 http://spotlight.dbpedia.org/ DBpedia Yes Yes Yes (Java) Illinois Wikifier
 http://cogcomp.cs.illinois.edu/page/download_view/Wikifier Wikipedia No No Yes (Java) TAGME
 https://tagme.d4science.org/tagme/ Wikipedia Yes Yes Yes (Java) Wikipedia Miner
 https://github.com/dnmilne/wikipediaminer Wikipedia No No Yes (Java)
  64. 64. Publicly available entity linking systems - AIDA [Hoffart et al., 2011] - Collective disambiguation using a graph-based approach (see under Disambiguation section) - DBpedia Spotlight [Mendes et al., 2011] - Straightforward local disambiguation approach - Vector space representations of entities are compared against the paragraphs of their mentions using the cosine similarity; - Introduce Inverse Candidate Frequency (ICF) instead of using IDF - Annotate with DBpedia entities, which can be restricted to certain types or even to a custom entity set defined by a SPARQL query
  65. 65. Publicly available entity linking systems - Illinois Wikifier (a.k.a. GLOW) [Ratinov et al., 2011] - Individual global disambiguation (see under Disambiguation section) - TAGME [Ferragina & Scaiella, 2011] - One of the most popular entity linking systems - Designed specifically for efficient annotation of short texts, but works well on long texts too - Individual global disambiguation (see under Disambiguation section) - Wikipedia Miner [Milne & Witten, 2008] - Seminal entity linking system that was first to ... - combine commonness and relatedness for (local) disambiguation - have open-source implementation and be offered as a web service
  66. 66. Evaluation
  67. 67. Evaluation (end-to-end) - Comparing the system-generated annotations against a human-annotated gold standard - Evaluation criteria - Perfect match: both the linked entity and the mention offsets must match - Relaxed match: the linked entity must match, it is sufficient if the mention overlaps with the gold standard
  68. 68. Evaluation with relaxed match Example #1 Example #2 ground truth system annotation ground truth system annotation
  69. 69. Evaluation metrics - Set-based metrics: - Precision: fraction of correctly linked entities that have been annotated by the system - Recall: fraction of correctly linked entities that should be annotated - F-measure: harmonic mean of precision and recall - Metrics are computed over a collection of documents - Micro-averaged: aggregated across mentions - Macro-averaged: aggregated across documents
  70. 70. Evaluation metrics - Micro-averaged - Macro-averaged - F1 score Pmic = |AD ˆAD| |AD| Rmic = |AD ˆAD| | ˆAD| ground truth annotations annotations generated by the entity linking system Pmac = X d2D |Ad ˆAd| |Ad| /|D| Rmac = X d2D |Ad ˆAd| | ˆAd| /|D| F1 = 2 ⇥ P ⇥ R P + R
  71. 71. Test collections by individual researchers System Reference KB Documents #Docs Annots. #Mentions All? NIL MSNBC 
 [Cucerzan, 2007] Wikipedia News 20 entities 755 Yes Yes AQUAINT 
 [Milne & Witten, 2008] Wikipedia News 50 entities & concepts 727 No No IITB 
 [Kulkarni et al, 2009] Wikipedia Web pages 107 entities 17,200 Yes Yes ACE2004 
 [Ratinov et al., 2011] Wikipedia News 57 entities 306 Yes Yes CoNLL-YAGO 
 [Hoffart et al., 2011] YAGO2 News 1,393 entites 34,956 Yes Yes
  72. 72. Evaluation campaigs - INEX Link-the-Wiki - TAC Entity Linking - Entity Recognition and Disambiguation (ERD) Challenge
  73. 73. Component-based evaluation - The pipeline architecture makes the evaluation of entity linking systems especially challenging - The main focus is on the disambiguation component, but its performance is largely influenced by the preceding steps - Fair comparison between two approaches can only be made if they share all other elements of the pipeline
  74. 74. Meta evaluations - [Hatchey et al., 2009] - Reimplements three state-of-the-art systems - DEXTER [Ceccarelli et al., 2013] - Reimplement TAGME, Wikipedia Miner, and collective EL [Han et al., 2011] - BAT-Framework [Cornolti et al., 2013] - Comparing publicly available entity annotation systems (AIDA, Illinois Wikifier, TAGME, Wikipedia Miner, DBpedia Spotlight) - Number of test collections (news, web pages, and tweets) - GEBRIL [Usbeck et al., 2015] - Extends the BAT-Framework by being able to link to any knowledge base - Benchmarking any annotation service through a REST interface - Provides persistent URLs for experimental settings (for reproducibility)
  75. 75. Resources
  76. 76. A Cross-Lingual Dictionary for English Wikipedia Concepts - Collecting strings (mentions) that link to Wikipedia articles on the Web - https://research.googleblog.com/2012/05/from-words-to-concepts- and-back.html number of times the mention is linked to that article string (mention) Wikipedia article
  77. 77. Freebase Annotations of the ClueWeb Corpora - ClueWeb annotated with Freebase entities (by Google) - http://lemurproject.org/clueweb09/FACC1/ - http://lemurproject.org/clueweb12/FACC1/ name of the document that was annotated entity mention beginning and end byte offsets entity ID in Freebaseconfidence given both the mention and the context confidence given just the context (ignoring the mention)
  78. 78. Entity linking in queries
  79. 79. Entity linking in queries - Challenges - search queries are short - limited context - lack of proper grammar, spelling - multiple interpretations - needs to be fast
  80. 80. Example the governator movie person
  81. 81. Example the governator movie
  82. 82. Example new york pizza manhattan
  83. 83. Example new york pizza manhattan
  84. 84. ERD’14 challenge - Task: finding query interpretations - Input: keyword query - Output: sets of sets of entities - Reference KB: Freebase - Annotations are to be performed by a web service within a given time limit
  85. 85. Evaluation P = |I T ˆI| |I| R = |I T ˆI| |ˆI| F = 2 · P · R P + R New York City, Manhattan ground truth system annotationˆI I new york pizza manhattan New York-style pizza, Manhattan New York City, Manhattan New York-style pizza
  86. 86. ERD’14 results Single interpretation is returned Rank Team F1 latency 1 SMAPH Team 0.7076 0.49 2 NTUNLP 0.6797 1.04 3 Seznam Research 0.6693 3.91 http://web-ngram.research.microsoft.com/erd2014/LeaderBoard.aspx
  87. 87. Hands-on part
 http://bit.ly/russir2016-elx
  88. 88. Questions? Contact | @krisztianbalog | krisztianbalog.com

×