Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Entity Linking in Queries: Efficiency vs. Effectiveness

393 views

Published on

Slides for the ECIR 2017 paper "Entity Linking in Queries: Efficiency vs. Effectiveness"

Identifying and disambiguating entity references in queries is one of the core enabling components for semantic search. While there is a large body of work on entity linking in documents, entity linking in queries poses new challenges due to the limited context the query provides coupled with the efficiency requirements of an online setting. Our goal is to gain a deeper understanding of how to approach entity linking in queries, with a special focus on how to strike a balance between effectiveness and efficiency. We divide the task of entity linking in queries to two main steps: candidate entity ranking and disambiguation, and explore both unsupervised and supervised alternatives for each step. Our main
finding is that best overall performance (in terms of efficiency and effectiveness) can be achieved by employing supervised learning for the entity ranking step, while tackling disambiguation with a simple unsupervised algorithm. Using the Entity Recognition and Disambiguation Challenge platform, we further demonstrate that our recommended method achieves state-of-the-art performance.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Entity Linking in Queries: Efficiency vs. Effectiveness

  1. 1. Faegheh Hasibi, Krisztian Balog, Svein E. Bratsberg ECIR 2017 ENTITY LINKING IN QUERIES: Efficiency vs. Effectiveness IAI_group
  2. 2. ENTITY LINKING TOTAL RECALL (1990 FILM) ARNOLD SCHWARZENEGGER PHILIP K. DICK ‣ Identifying entities in the text and linking them to the corresponding entry in the knowledge base Total Recall is a 1990 American science fiction action film directed by Paul Verhoeven, starring Rachel Ticotin, Sharon Stone, Arnold Schwarzenegger, Ronny Cox and Michael Ironside. The film is loosely based on the Philip K. Dick story…
  3. 3. TOTAL RECALL (1990 FILM) ARNOLD SCHWARZENEGGER ENTITY LINKING IN QUERIES (ELQ) total recall arnold schwarzenegger
  4. 4. TOTAL RECALL (1990 FILM) ARNOLD SCHWARZENEGGER ENTITY LINKING IN QUERIES (ELQ) total recall arnold schwarzenegger total recall TOTAL RECALL (2012 FILM)TOTAL RECALL (1990 FILM)
  5. 5. Query Entity Linking Interpretation ‣ Identifying sets of entity linking interpretations ENTITY LINKING IN QUERIES (ELQ) [Carmel et al. 2014], [Hasibi et al. 2015], [Cornolti et al. 2016]
  6. 6. CONVENTIONAL APPROACH Mention detection Candidate Entity Ranking Disambiguation new york new york pizza 0.8 NEW YORK NEW YORK CITY .. NEW YORK-STYLE PIZZA SICILIAN_PIZZA … 0.7 0.1 m e 0.9 (new york, NEW YORK), (manhattan, MANHATTAN) (new york pizza, NEW YORK-STYLE PIZZA), (manhattan, MANHATTAN) s(m,e) interpretation sets 1 2 …
  7. 7. Incorporating different signals: CONVENTIONAL APPROACH ‣ Contextual similarity between a candidate entity and text (or entity mention) ‣ Interdependence between all entity linking decisions (extracted from the underlying KB) e1 e2 e3 e4e5
  8. 8. What is special about entity linking in queries? What is really different when it comes to the queries?
  9. 9. RESEARCH QUESTIONS ELQ is an online process and should be done under strict time constraints If we are to allocate the available processing time between the two steps, which one would yield the highest gain? RQ1?
  10. 10. Which group of features is needed the most for effective entity linking: contextual similarity, interdependence between entities, or both? RQ2? RESEARCH QUESTIONS The context provided by the queries is limited

  11. 11. SYSTEMATIC INVESTIGATION METHODS: (1) Unsupervised (Greedy) (2) Supervised (LTR) METHODS: (1) Unsupervised (MLMcg) (2) Supervised (LTR) Candidate Entity Ranking Disambiguation 1 2
  12. 12. SYSTEMATIC INVESTIGATION METHODS: (1) Unsupervised (Greedy) (2) Supervised (LTR) METHODS: (1) Unsupervised (MLMcg) (2) Supervised (LTR) Candidate Entity Ranking Disambiguation 1 2
  13. 13. Estimate: score(m, e, q) Given: • mention m • entity e • query q CANDIDATE ENTITY RANKING (I) Identify all possible entities that can be linked in the query (II) Rank them based on how likely they are link targets Goal ‣ lexical matching of query n-grams against a rich dictionary of entity name variants
  14. 14. UNSUPERVISED [MLMcg] rectly in the subsequent disambiguation step. Using lexical matching of query n-g against a rich dictionary of entity name variants allows for the identification of ca date entities with close to perfect recall [16]. We follow this approach to obtain a l candidate entities together with their corresponding mentions in the query. Our f of attention below is on ranking these candidate (m, e) pairs with respect to the q i.e., estimating score(m, e, q). Unsupervised For the unsupervised ranking approach, we take a state-of-the-art g ative model, specifically, the MLMcg model proposed by Hasibi et al. [16]. This m considers both the likelihood of the given mention and the similarity between the q and the entity: score(m, e, q) = P(e|m)P(q|e), where P(e|m) is the probability mention being linked to an entity (a.k.a. commonness [22]), computed from the F collection [12]. The query likelihood P(q|e) is estimated using the query length malized language model similarity [20]: P(q|e) = Q t2q P(t|✓e)P (t|q) Q t2q P(t|C)P (t|q) , where P(t|q) is the term’s relative frequency in the query (i.e., n(t, q)/|q|). The e and collection language models, P(t|✓e) and P(t|C), are computed using the Mix of Language Models (MLM) approach [27]. against a rich dictionary of entity name variants allows for the identificat date entities with close to perfect recall [16]. We follow this approach to o candidate entities together with their corresponding mentions in the quer of attention below is on ranking these candidate (m, e) pairs with respect i.e., estimating score(m, e, q). Unsupervised For the unsupervised ranking approach, we take a state-of-t ative model, specifically, the MLMcg model proposed by Hasibi et al. [16] considers both the likelihood of the given mention and the similarity betwe and the entity: score(m, e, q) = P(e|m)P(q|e), where P(e|m) is the pro mention being linked to an entity (a.k.a. commonness [22]), computed fro collection [12]. The query likelihood P(q|e) is estimated using the query malized language model similarity [20]: P(q|e) = Q t2q P(t|✓e)P (t|q) Q t2q P(t|C)P (t|q) , where P(t|q) is the term’s relative frequency in the query (i.e., n(t, q)/|q and collection language models, P(t|✓e) and P(t|C), are computed using Generative model for ranking entities based on the query and the specific mention Prob. mention being linked to the entity query likelihood based on
 Mixture of Language Models (MLM) [Hasibi et al. 2015]
  15. 15. SUPERVISED [LTR] 28 features mainly from literature Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and supervised alternatives, by adapting existing methods from the literature. We further Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and supervised alternatives, by adapting existing methods from the literature. We further Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- [Meij et al. 2013], [Medelyan et al. 2008]
  16. 16. SUPERVISED [LTR] 28 features mainly from literature Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and supervised alternatives, by adapting existing methods from the literature. We further Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and supervised alternatives, by adapting existing methods from the literature. We further Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- tions {E1, ..., Em}. Similar to the previous step, we examine both unsupervised and Entity Linking in Queries: Efficiency vs. Effectiveness 5 Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention- entity (ME), and query (Q) features. Feature Description Type Len(m) Number of terms in the entity mention M NTEM(m)‡ Number of entities whose title equals the mention M SMIL(m)‡ Number of entities whose title equals part of the mention M Matches(m) Number of entities whose surface form matches the mention M Redirects(e) Number of redirect pages linking to the entity E Links(e) Number of entity out-links in DBpedia E Commonness(e, m) Likelihood of entity e being the target link of mention m ME MCT(e, m)‡ True if the mention contains the title of the entity ME TCM(e, m)‡ True if title of the entity contains the mention ME TEM(e, m)‡ True if title of the entity equals the mention ME Pos1(e, m) Position of the 1st occurrence of the mention in entity abstract ME SimMf (e, m)† Similarity between mention and field f of entity; Eq. (1) ME LenRatio(m, q) Mention to query length ratio: |m| |q| Q QCT(e, q) True if the query contains the title of the entity Q TCQ(e, q) True if the title of entity contains the query Q TEQ(e, q) True if the title of entity is equal query Q Sim(e, q) Similarity between query and entity; Eq. (1) Q SimQf (e, q)† LM similarity between query and field f of entity; Eq. (1) Q ‡ Entity title refers to the rdfs:label predicate of the entity in DBpedia † Computed for all individual DBpedia fields f 2 F and also for field content (cf. Sec 4.1) . 3.2 Disambiguation The disambiguation step is concerned with the formation of entity linking interpreta- [Meij et al. 2012], [Medelyan et al. 2008] Extracted based on:
 Mention, Entity, Mention-entity, Query
  17. 17. TEST COLLECTIONS ‣ Y-ERD • Based on the Yahoo Search Query Log to Entities dataset • 2398 queries ‣ ERD-dev • Released as part of the ERD challenge 2014 • 91 queries [Carmel et al. 2014], [Hasibi et al. 2015]
  18. 18. Entity Linking in Queries: Efficiency vs. Effectiveness 9 Table 4. Candidate entity ranking results on the Y-ERD and ERD-dev datasets. Best scores for each metric are in boldface. Significance for line i > 1 is tested against lines 1..i 1. Method Y-ERD ERD-dev MAP R@5 P@1 MAP R@5 P@1 MLM 0.7507 0.8556 0.6839 0.7675 0.8622 0.7333 CMNS 0.7831N 0.8230N 0.7779N 0.7037 0.7222O 0.7556 MLMcg 0.8536NN 0.8997NN 0.8280NN 0.8543MN 0.9015 N 0.8444 LTR 0.8667NNN 0.9022NN 0.8479NNN 0.8606MN 0.9289MN 0.8222 Table 5. End-to-end performance of ELQ systems on the Y-ERD and ERD-dev query sets. Sig- nificance for line i > 1 is tested against lines 1..i 1. Method Y-ERD ERD-dev Prec Recall F1 Time Prec Recall F1 Time MLMcg-Greedy 0.709 0.709 0.709 0.058 0.724 0.712 0.713 0.085 MLMcg-LTR 0.725 0.724 0.724 0.893 0.725 0.731 0.728 1.185 LTR-LTR 0.731M 0.732M 0.731M 0.881 0.758 0.748 0.753 1.185 LTR-Greedy 0.786NNN 0.787NNN 0.787NNN 0.382 0.852NNM 0.828NM 0.840NNM 0.423 RESULTS Both methods (MLMcg and LTR) are able to find the vast majority of the relevant entities and return them at the top ranks.
  19. 19. RESULTS Both methods (MLMcg and LTR) are able to find the vast majority of the relevant entities and return them at the top ranks. Entity Linking in Queries: Efficiency vs. Effectiveness 9 Table 4. Candidate entity ranking results on the Y-ERD and ERD-dev datasets. Best scores for each metric are in boldface. Significance for line i > 1 is tested against lines 1..i 1. Method Y-ERD ERD-dev MAP R@5 P@1 MAP R@5 P@1 MLM 0.7507 0.8556 0.6839 0.7675 0.8622 0.7333 CMNS 0.7831N 0.8230N 0.7779N 0.7037 0.7222O 0.7556 MLMcg 0.8536NN 0.8997NN 0.8280NN 0.8543MN 0.9015 N 0.8444 LTR 0.8667NNN 0.9022NN 0.8479NNN 0.8606MN 0.9289MN 0.8222 Table 5. End-to-end performance of ELQ systems on the Y-ERD and ERD-dev query sets. Sig- nificance for line i > 1 is tested against lines 1..i 1. Method Y-ERD ERD-dev Prec Recall F1 Time Prec Recall F1 Time MLMcg-Greedy 0.709 0.709 0.709 0.058 0.724 0.712 0.713 0.085 MLMcg-LTR 0.725 0.724 0.724 0.893 0.725 0.731 0.728 1.185 LTR-LTR 0.731M 0.732M 0.731M 0.881 0.758 0.748 0.753 1.185 LTR-Greedy 0.786NNN 0.787NNN 0.787NNN 0.382 0.852NNM 0.828NM 0.840NNM 0.423
  20. 20. SYSTEMATIC INVESTIGATION METHODS: (1) Unsupervised (Greedy) (2) Supervised (LTR) METHODS: (1) Unsupervised (MLMcg) (2) Supervised (LTR) Candidate Entity Ranking Disambiguation 21
  21. 21. Identify interpretation set(s): Ei ={(m1,e1),...,(mk,ek)} where:
 mentions of each set are
 non-overlapping DISAMBIGUATION Goal (new york, NEW YORK), (manhattan, MANHATTAN) (new york pizza, NEW YORK-STYLE PIZZA), (manhattan, MANHATTAN) interpretation sets
  22. 22. UNSUPERVISED - [Greedy] Algorithm 1 Greedy Interpretation Finding (GIF) Input: Ranked list of mention-entity pairs M; score threshold s Output: Interpretations I = {E1, ..., Em} begin 1: M0 Prune(M, s) 2: M0 PruneContainmentMentions(M0 ) 3: I CreateInterpretations(M0 ) 4: return I end 1: function CREATEINTERPRETATIONS(M) 2: I {;} 3: for (m, e) in M do 4: h 0 5: for E in I do 6: if ¬ hasOverlap(E, (m, e)) then 7: E.add((m, e)) 8: h 1 9: end if 10: end for 11: if h == 0 then 12: I.add({(m, e)}) 13: end if 14: end for 15: return I 16: end function (I) Pruning of mention-entity pairs Discards the ones with score < threshold τs
 For containment mentions, keep only the highest scoring one (II) Set generation Adding mention-entity pairs in decreasing order of score to an interpretation set
  23. 23. SUPERVISED (I) Generate all interpretations (from the top-K entity-mention pairs) (II) Collectively select the interpretations with a binary classifier Collective disambiguation new york manhattan 0.8 NEW YORK CITY NEW YORK MANHATTAN MANHATTAN (FILM) 0.7 0.2 m e 0.8 (new york, NEW YORK), (manhattan, MANHATTAN) (new york, NEW YORK CITY), (manhattan, MANHATTAN) interpretation sets … SYRACUSE, NEW YORK ALBANY,_NEW_YORK new york pizza NEW YORK-STYLE PIZZA 0.9 K 0.5 0.2 (new york, SYRACUSE, NEW YORK), (manhattan, MANHATTAN) (new york pizza, NEW YORK-STYLE PIZZA), (manhattan, MANHATTAN) (new york pizza, NEW YORK-STYLE PIZZA) … ✓ ✗ ✗ ✓ ✗ new york manhattan new york new york …
  24. 24. SUPERVISED 6 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg Table 3. Feature set used in the supervised disambiguation approach. Type is either query depen- dent (QD) or query independent (QI). Set-based Features Type CommonLinks(E) Number of common links in DBpedia: T e2E out(e). QI TotalLinks(E) Number of distinct links in DBpedia: S e2E out(e) QI JKB(E) Jaccard similarity based on DBpedia: CommonLinks(E) T otalLink(E) QI Jcorpora(E)‡ Jaccard similarity based on FACC: | T e2E doc(e)| | S e2E doc(e)| QI RelMW (E)‡ Relatedness similarity [25] according to FACC QI P(E) Co-occurrence probability based on FACC: | T e2E doc(e)| T otalDocs QI H(E) Entropy of E: P (E)log(P (E)) (1 P (E))log(1 P (E)) QI Completeness(E)† Completeness of set E as a graph: |edges(GE )| |edges(K|E|)| QI LenRatioSet(E, q)§ Ratio of mentions length to the query length: P e2E |me| |q| QD SetSim(E, q) Similarity between query and the entities in the set; Eq (2) QD Entity-based Features Links(e) Number of entity out-links in DBpedia QI Commonness(e, m) Likelihood of entity e being the target link of mention m QD Score(e, q) Entity ranking score, obtained from the CER step QD iRank(e, q) Inverse of rank, obtained from the CER step: 1 rank(e,q) QD Sim(e, q) Similarity between query and the entity; Eq. (1) QD ContextSim(e, q) Contextual similarity between query and entity; Eq (3) QD ‡ doc(e) represents all documents that have a link to entity e † GE is a DBpedia subgraph containing only entities from E; and K|E| is a complete graph of |E| vertices § me denotes the mention that corresponds to entity e mention-entity pairs (obtained from the CER step) and generate all possible interpreta- tions out of those. We further require that mentions within the same interpretation do 6 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg Table 3. Feature set used in the supervised disambiguation approach. Type is either query depen- dent (QD) or query independent (QI). Set-based Features Type CommonLinks(E) Number of common links in DBpedia: T e2E out(e). QI TotalLinks(E) Number of distinct links in DBpedia: S e2E out(e) QI JKB(E) Jaccard similarity based on DBpedia: CommonLinks(E) T otalLink(E) QI Jcorpora(E)‡ Jaccard similarity based on FACC: | T e2E doc(e)| | S e2E doc(e)| QI RelMW (E)‡ Relatedness similarity [25] according to FACC QI P(E) Co-occurrence probability based on FACC: | T e2E doc(e)| T otalDocs QI H(E) Entropy of E: P (E)log(P (E)) (1 P (E))log(1 P (E)) QI Completeness(E)† Completeness of set E as a graph: |edges(GE )| |edges(K|E|)| QI LenRatioSet(E, q)§ Ratio of mentions length to the query length: P e2E |me| |q| QD SetSim(E, q) Similarity between query and the entities in the set; Eq (2) QD Entity-based Features Links(e) Number of entity out-links in DBpedia QI Commonness(e, m) Likelihood of entity e being the target link of mention m QD Score(e, q) Entity ranking score, obtained from the CER step QD iRank(e, q) Inverse of rank, obtained from the CER step: 1 rank(e,q) QD Sim(e, q) Similarity between query and the entity; Eq. (1) QD ContextSim(e, q) Contextual similarity between query and entity; Eq (3) QD ‡ doc(e) represents all documents that have a link to entity e † GE is a DBpedia subgraph containing only entities from E; and K|E| is a complete graph of |E| vertices § me denotes the mention that corresponds to entity e Entity-based features computed for the entire interpretation set individual entity features, aggregated for all entities in the set Set-based features
  25. 25. SUPERVISED 6 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg Table 3. Feature set used in the supervised disambiguation approach. Type is either query depen- dent (QD) or query independent (QI). Set-based Features Type CommonLinks(E) Number of common links in DBpedia: T e2E out(e). QI TotalLinks(E) Number of distinct links in DBpedia: S e2E out(e) QI JKB(E) Jaccard similarity based on DBpedia: CommonLinks(E) T otalLink(E) QI Jcorpora(E)‡ Jaccard similarity based on FACC: | T e2E doc(e)| | S e2E doc(e)| QI RelMW (E)‡ Relatedness similarity [25] according to FACC QI P(E) Co-occurrence probability based on FACC: | T e2E doc(e)| T otalDocs QI H(E) Entropy of E: P (E)log(P (E)) (1 P (E))log(1 P (E)) QI Completeness(E)† Completeness of set E as a graph: |edges(GE )| |edges(K|E|)| QI LenRatioSet(E, q)§ Ratio of mentions length to the query length: P e2E |me| |q| QD SetSim(E, q) Similarity between query and the entities in the set; Eq (2) QD Entity-based Features Links(e) Number of entity out-links in DBpedia QI Commonness(e, m) Likelihood of entity e being the target link of mention m QD Score(e, q) Entity ranking score, obtained from the CER step QD iRank(e, q) Inverse of rank, obtained from the CER step: 1 rank(e,q) QD Sim(e, q) Similarity between query and the entity; Eq. (1) QD ContextSim(e, q) Contextual similarity between query and entity; Eq (3) QD ‡ doc(e) represents all documents that have a link to entity e † GE is a DBpedia subgraph containing only entities from E; and K|E| is a complete graph of |E| vertices § me denotes the mention that corresponds to entity e mention-entity pairs (obtained from the CER step) and generate all possible interpreta- tions out of those. We further require that mentions within the same interpretation do 6 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg Table 3. Feature set used in the supervised disambiguation approach. Type is either query depen- dent (QD) or query independent (QI). Set-based Features Type CommonLinks(E) Number of common links in DBpedia: T e2E out(e). QI TotalLinks(E) Number of distinct links in DBpedia: S e2E out(e) QI JKB(E) Jaccard similarity based on DBpedia: CommonLinks(E) T otalLink(E) QI Jcorpora(E)‡ Jaccard similarity based on FACC: | T e2E doc(e)| | S e2E doc(e)| QI RelMW (E)‡ Relatedness similarity [25] according to FACC QI P(E) Co-occurrence probability based on FACC: | T e2E doc(e)| T otalDocs QI H(E) Entropy of E: P (E)log(P (E)) (1 P (E))log(1 P (E)) QI Completeness(E)† Completeness of set E as a graph: |edges(GE )| |edges(K|E|)| QI LenRatioSet(E, q)§ Ratio of mentions length to the query length: P e2E |me| |q| QD SetSim(E, q) Similarity between query and the entities in the set; Eq (2) QD Entity-based Features Links(e) Number of entity out-links in DBpedia QI Commonness(e, m) Likelihood of entity e being the target link of mention m QD Score(e, q) Entity ranking score, obtained from the CER step QD iRank(e, q) Inverse of rank, obtained from the CER step: 1 rank(e,q) QD Sim(e, q) Similarity between query and the entity; Eq. (1) QD ContextSim(e, q) Contextual similarity between query and entity; Eq (3) QD ‡ doc(e) represents all documents that have a link to entity e † GE is a DBpedia subgraph containing only entities from E; and K|E| is a complete graph of |E| vertices § me denotes the mention that corresponds to entity e Set-based features Entity-based features Query independent
 features
  26. 26. Method Y-ERD ERD-dev MAP R@5 P@1 MAP R@5 P@1 MLM 0.7507 0.8556 0.6839 0.7675 0.8622 0.7333 CMNS 0.7831N 0.8230N 0.7779N 0.7037 0.7222O 0.7556 MLMcg 0.8536NN 0.8997NN 0.8280NN 0.8543MN 0.9015 N 0.8444 LTR 0.8667NNN 0.9022NN 0.8479NNN 0.8606MN 0.9289MN 0.8222 Table 5. End-to-end performance of ELQ systems on the Y-ERD and ERD-dev query sets. Sig- nificance for line i > 1 is tested against lines 1..i 1. Method Y-ERD ERD-dev Prec Recall F1 Time Prec Recall F1 Time MLMcg-Greedy 0.709 0.709 0.709 0.058 0.724 0.712 0.713 0.085 MLMcg-LTR 0.725 0.724 0.724 0.893 0.725 0.731 0.728 1.185 LTR-LTR 0.731M 0.732M 0.731M 0.881 0.758 0.748 0.753 1.185 LTR-Greedy 0.786NNN 0.787NNN 0.787NNN 0.382 0.852NNM 0.828NM 0.840NNM 0.423 Candidate entity ranking Table 4 presents the results for CER on the Y-ERD and ERD-dev datasets. We find that commonness is a strong performer (this is in line with the findings of [1, 16]). Combining commonness with MLM in a generative model (MLMcg) delivers excellent performance, with MAP above 0.85 and R@5 around 0.9. The LTR approach can bring in further slight, but for Y-ERD significant, improvements. This means that both of our CER methods (MLMcg and LTR) are able to find the vast majority of the relevant entities and return them at the top ranks. Disambiguation Table 5 reports on the disambiguation results. We use the naming RESULTS LTR-Greedy outperforming other approaches significantly on both test sets, and is the second most efficient one.
  27. 27. RESULTS Comparison with the top performers of the ERD challenge (on official ERD test platform) LTR-Greedy approach performs on a par with the state-of-the-art systems. Remarkable finding considering the complexity of other solutions. bi, Krisztian Balog, and Svein Erik Bratsberg Table 6. ELQ results on the of- ficial ERD test platform. Method F1 LTR-Greedy 0.699 SMAPH-2 [6] 0.708 NTUNLP [5] 0.680 Seznam [10] 0.669 ults, LTR-Greedy is our overall rec- ompare this method against the top RD challenge (using the official chal- Table 6. For this comparison, we spell checking, as this has also been erforming system (SMAPH-2) [6]. hat our LTR-Greedy approach per- h the state-of-the-art systems. This g into account the simplicity of the ion algorithm vs. the considerably ons employed by others. r results reveal that candidate entity ranking is of higher importance for ELQ. Hence, it is more beneficial to perform the (expensive) early on in the pipeline for the seemingly easier CER step; dis- n be tackled successfully with an unsupervised (greedy) algorithm. he top ranked entity does not yield an immediate solution; as shown ion is an indispensable step in ELQ.) top-3 systems
  28. 28. Entity Linking in Queries: Efficiency vs. Effectiveness 11 3 0.06 0.09 0.12 SetSim LenRatioSet P ContextSim/avg H ContextSim/min ContextSim/max Commonness/avg iRank/max Commonness/min iRank/min iRank/avg Score/max Score/min Score/avg 0.00 0.04 0.08 0.12 0.16 0.20 LTR-LTR MLM-LTR SimQ-label SimQ-subject SimQ-asbtract Len Links Redirects LenRatio TEM SimQ-wikiLink Sim QCT SimQ-content MCT Commonness Matches 0.00 0.03 t features used in the supervised approaches, sorted by Gini score: (Left) ng, (Right) Disambiguation. ore components: candidate entity ranking and disambiguation. For FEATURE IMPORTANCE LTR-LTR
 MLMcg-LTR
  29. 29. WE ASKED … If we are to allocate the available processing time between the two steps, which one would yield the highest gain? RQ1? Candidate entity ranking is of higher importance than disambiguation. 
 It is more beneficial to perform the (expensive) supervised learning early on, and tackle disambiguation with an unsupervised algorithm. Answer
  30. 30. WE ASKED … Which group of features is needed the most for effective entity disambiguation: contextual similarity, interdependence between entities, or both? RQ2? Contextual similarity features are the most effective for entity disambiguation.
 
 Entity interdependences are helpful when sufficiently many entities are mentioned in the text; not the case for queries. Answer
  31. 31. Efficiency vs. Effectiveness
  32. 32. Code and resources at: http://bit.ly/ecir2017-elq Thank you! Questions?

×