Context-Enhanced Adaptive
Entity Linking
@giusepperizzo
F. Ilievski, G. Rizzo, M. van Erp, J. Plu, R. Troncy
2http://babelfy.org, as of 2016-05-24
Linguistic approach: A text is parsed by a NER classifier. Entity labels are
used to look up resources in a referent KB. A ranking function is used to
select the best match (relatedness, semantic similarity)
End-to-End approach: A dictionary of mentions and links is built from a
referent KB. A text is split in n-grams that are used to look up candidate
links from the dictionary. A selection function is used to pick the best
match (relatedness, semantic similarity, relevance)
Hybrid approach: combination of both
Current approaches performing the NEL task
3
ranking and selection of the candidate links are led by the
relatedness of the entities in the knowledge base
Henri Leland
db:Henry_M._Leland
Lincoln Motor Company
db:Lincoln_Motor_Company
Joe
?
Lincoln
db:Abraham_Lincoln
Cadillac
db:Cadillac
when the context is
poor, head entities are
favoured
4
?
“Henry Leland
… formed the
Lincoln Motor
Company... Joe
drove a Lincoln
for the first
time in his life”
…
Joe PER NIL
Lincoln PRO
db:Lincoln_Motor
_Company
5
Text as
input
Reranking
with
Context
...
Joe PER NIL
Lincoln PRO
db:Lincoln_Motor_
Company
Resolution
and
Classification
Candidate
Selection
Mention
Extraction
General-purpose Hybrid Annotator
6
Text as
input
Resolution
and
Classification
Candidate
Selection
Mention
Extraction
General-purpose Hybrid Annotator
- Longest match
- Entity type
propagation to the
longest match
- Dictionary fuzzy
match
- Entity
popularity
- Named Entity
Recognition
- Proper Noun
Extractor
e1 c1,1, …, c1,10
...
en cn,1, …, cn,10
7
General-purpose Hybrid Annotator (I)
Mention extraction: proper nouns as classified by Stanford POS Tagger (trained with
english-bidirectional-distsim model) and named entities as classified by Stanford
NERClassifierCombiner (trained with CoNLL 2003, MUC 6, MUC 7 corpora)
Resolution and typing
o When one mention is substring of another we take the longest:
o When a part of one mention is a substring of another we do a merge to create a new
one:
POS: (United States, NNPS)
NER: (United States of America, PLACE)
United States of America, PLACE
POS: (United States, NNPS)
NER: (States of America, PLACE)
United States of America, PLACE
Plu et al., Revealing Entities from Textual Documents Using a Hybrid Approach, (ISWC'15) NLP & DBpedia 2015 8
General-purpose Hybrid Annotator (II)
Candidate selection: fuzzy string match over an index based on DBpedia2015-04
o NIL Clustering when no candidates are found. Exact match of labels within the
boundaries of a sentence
o Candidate Ranking if multiple candidates are found.
9
r(l): the score of the label l
L: the Levenshtein distance
m: the extracted mention
title: the title of the label l
R: the set of redirect pages associated to
the label l
D: the set of disambiguation pages
associated to the label l
PR: Pagerank associated to the label l
a, b and c are weights
following the
properties:
● a > b > c
● a + b + c = 1
Text
as
input
Reranking
with
Context
e1 c1,1, …, c1,10
...
en cn,1, …, cn,10
10
e1 c1,1, c1,2, …, c1,10
...
en cn,1, …, cn,9, cn,10
en+1 cn+1
Reranking with Context
Aim: Adapt the linking task to the textual content that is being
analysed
Approach: Leverage the genre and topic domain information about
the text
Apply: 4 heuristics (H1, H2, H3, H4) in cascade. They take the form
of binary rules
11
H1: Order of processing
Process the running text sequentially, starting from the first
sentence. Process the title at the end
Reasoning: Title is typically ambiguous/catchy. The first
sentences of an article are written most explicitly
12
H2: Coherence
Detect if an entity is co-referential (an abbreviation or a
substring) with an entity that occurs previously in the same
news article
Reasoning: Once the writer has clearly introduced an entity,
she can use abbreviations or more ambiguous ways to refer
to it later in text
13
H3: Domain relevance
Use a contextual knowledge base to examine whether a
mention has been frequently and dominantly associated with
a certain entity within a domain
Reasoning: It is customary that the entities mentioned in
domain-specific text stem from the same domain. Also, within
a domain, a mention is typically associated with one
dominant entity
14
H4: Semantic Typing
Check whether the semantic type of the entity resolved by H2
or H3 fits the textual context
Reasoning: The entity should fit the textual context and fulfil
a certain role in text
15
MEANTIME* AIDA-YAGO2**
benchmark the approach
with a corpus composed
of 4 topic-specific gold
standards
test the generalizability of
the approach
16
*Minard et al., MEANTIME, the newsreader multilingual event and time corpus. LREC 2016
**Hoffart et al., Robust Disambiguation of Named Entities in Text. EMNLP 2011
Benchmark corpora
Number
of
Articles
Number of
Tokens
Number
of Entities
Number of
Links
Number
of NILs
Number of
Entity
Types
MEANTI
ME*
airbus 30 3,620 614 414 200 5
apple 30 3,452 812 525 287 5
gm 30 3,641 760 526 234 5
stock 30 3,362 449 331 118 4
AIDA-
YAGO2**
231 46,435 5,616 4,485 1,131 4
17
Corpora statistics
Experimental results
airbus apple gm stock AIDA-YAGO2
P R F1 P R F1 P R F1 P R F1 P R F1
Hybrid 58.7
4
40.5
8
48 19.7
8
10.0
9
13.3
7
50.3
6
26.8
1
34.9
9
59.1
2
32.3
3
41.8 49.1
4
43.4
1
46.1
Hybrid
+H1+H2
59.0
9
40.8
2
48.2
9
19.7
8
10.0
9
13.3
7
55 29.2
8
38.2
1
59.1
2
32.3
3
41.8 48.6
7
43.0
1
45.6
7
Hybrid
+H1+H3
62.5 43.4
8
51.2
8
20.0
7
10.2
9
13.6 63.5
4
34.7
9
44.9
6
66.3
1
37.4
6
47.8
8
57.8
9
52.0
4
54.8
1
Hybrid
+H1+H2+H3
62.1
5
43.2
4
51 20.0
7
10.2
9
13.6 67.3
6
36.8
8
47.6
7
68.9
8
38.9
7
49.8
1
57.6
5
51.8
4
54.5
9
Hybrid
+H1+H2+H3
+H4
61.4
6
42.7
5
50.4
3
20.0
7
10.2
9
13.6 62.1 33.6
5
43.6
5
63.1 35.6
5
45.5
6
55.2
1
49.6
1
52.2
6
Discussion
Reranking with context is effective and brings improvement over the baseline
for all corpora
Improvement also on AIDA-YAGO2, even though it stems from a neutral topic
domain. This is because MEANTIME and AIDA-YAGO2 share the genre domain,
and many of the entities in MEANTIME stem from the neutral domain as well
H1 (Order of Processing) and H3 (Domain Relevance) with these settings are the
most effective heuristics
H4 (Semantic Typing) requires further investigations
19
Future Work
Model the genre and topic domains to contextualize further the entity linking,
i.e. adding more features to improve our adaptive contextual model
Investigate the dynamic adaptability in different contexts using knowledge
bases as inputs
20
Acknowledgements
21

Context-Enhanced Adaptive Entity Linking

  • 1.
    Context-Enhanced Adaptive Entity Linking @giusepperizzo F.Ilievski, G. Rizzo, M. van Erp, J. Plu, R. Troncy
  • 2.
  • 3.
    Linguistic approach: Atext is parsed by a NER classifier. Entity labels are used to look up resources in a referent KB. A ranking function is used to select the best match (relatedness, semantic similarity) End-to-End approach: A dictionary of mentions and links is built from a referent KB. A text is split in n-grams that are used to look up candidate links from the dictionary. A selection function is used to pick the best match (relatedness, semantic similarity, relevance) Hybrid approach: combination of both Current approaches performing the NEL task 3
  • 4.
    ranking and selectionof the candidate links are led by the relatedness of the entities in the knowledge base Henri Leland db:Henry_M._Leland Lincoln Motor Company db:Lincoln_Motor_Company Joe ? Lincoln db:Abraham_Lincoln Cadillac db:Cadillac when the context is poor, head entities are favoured 4 ?
  • 5.
    “Henry Leland … formedthe Lincoln Motor Company... Joe drove a Lincoln for the first time in his life” … Joe PER NIL Lincoln PRO db:Lincoln_Motor _Company 5
  • 6.
    Text as input Reranking with Context ... Joe PERNIL Lincoln PRO db:Lincoln_Motor_ Company Resolution and Classification Candidate Selection Mention Extraction General-purpose Hybrid Annotator 6
  • 7.
    Text as input Resolution and Classification Candidate Selection Mention Extraction General-purpose HybridAnnotator - Longest match - Entity type propagation to the longest match - Dictionary fuzzy match - Entity popularity - Named Entity Recognition - Proper Noun Extractor e1 c1,1, …, c1,10 ... en cn,1, …, cn,10 7
  • 8.
    General-purpose Hybrid Annotator(I) Mention extraction: proper nouns as classified by Stanford POS Tagger (trained with english-bidirectional-distsim model) and named entities as classified by Stanford NERClassifierCombiner (trained with CoNLL 2003, MUC 6, MUC 7 corpora) Resolution and typing o When one mention is substring of another we take the longest: o When a part of one mention is a substring of another we do a merge to create a new one: POS: (United States, NNPS) NER: (United States of America, PLACE) United States of America, PLACE POS: (United States, NNPS) NER: (States of America, PLACE) United States of America, PLACE Plu et al., Revealing Entities from Textual Documents Using a Hybrid Approach, (ISWC'15) NLP & DBpedia 2015 8
  • 9.
    General-purpose Hybrid Annotator(II) Candidate selection: fuzzy string match over an index based on DBpedia2015-04 o NIL Clustering when no candidates are found. Exact match of labels within the boundaries of a sentence o Candidate Ranking if multiple candidates are found. 9 r(l): the score of the label l L: the Levenshtein distance m: the extracted mention title: the title of the label l R: the set of redirect pages associated to the label l D: the set of disambiguation pages associated to the label l PR: Pagerank associated to the label l a, b and c are weights following the properties: ● a > b > c ● a + b + c = 1
  • 10.
    Text as input Reranking with Context e1 c1,1, …,c1,10 ... en cn,1, …, cn,10 10 e1 c1,1, c1,2, …, c1,10 ... en cn,1, …, cn,9, cn,10 en+1 cn+1
  • 11.
    Reranking with Context Aim:Adapt the linking task to the textual content that is being analysed Approach: Leverage the genre and topic domain information about the text Apply: 4 heuristics (H1, H2, H3, H4) in cascade. They take the form of binary rules 11
  • 12.
    H1: Order ofprocessing Process the running text sequentially, starting from the first sentence. Process the title at the end Reasoning: Title is typically ambiguous/catchy. The first sentences of an article are written most explicitly 12
  • 13.
    H2: Coherence Detect ifan entity is co-referential (an abbreviation or a substring) with an entity that occurs previously in the same news article Reasoning: Once the writer has clearly introduced an entity, she can use abbreviations or more ambiguous ways to refer to it later in text 13
  • 14.
    H3: Domain relevance Usea contextual knowledge base to examine whether a mention has been frequently and dominantly associated with a certain entity within a domain Reasoning: It is customary that the entities mentioned in domain-specific text stem from the same domain. Also, within a domain, a mention is typically associated with one dominant entity 14
  • 15.
    H4: Semantic Typing Checkwhether the semantic type of the entity resolved by H2 or H3 fits the textual context Reasoning: The entity should fit the textual context and fulfil a certain role in text 15
  • 16.
    MEANTIME* AIDA-YAGO2** benchmark theapproach with a corpus composed of 4 topic-specific gold standards test the generalizability of the approach 16 *Minard et al., MEANTIME, the newsreader multilingual event and time corpus. LREC 2016 **Hoffart et al., Robust Disambiguation of Named Entities in Text. EMNLP 2011 Benchmark corpora
  • 17.
    Number of Articles Number of Tokens Number of Entities Numberof Links Number of NILs Number of Entity Types MEANTI ME* airbus 30 3,620 614 414 200 5 apple 30 3,452 812 525 287 5 gm 30 3,641 760 526 234 5 stock 30 3,362 449 331 118 4 AIDA- YAGO2** 231 46,435 5,616 4,485 1,131 4 17 Corpora statistics
  • 18.
    Experimental results airbus applegm stock AIDA-YAGO2 P R F1 P R F1 P R F1 P R F1 P R F1 Hybrid 58.7 4 40.5 8 48 19.7 8 10.0 9 13.3 7 50.3 6 26.8 1 34.9 9 59.1 2 32.3 3 41.8 49.1 4 43.4 1 46.1 Hybrid +H1+H2 59.0 9 40.8 2 48.2 9 19.7 8 10.0 9 13.3 7 55 29.2 8 38.2 1 59.1 2 32.3 3 41.8 48.6 7 43.0 1 45.6 7 Hybrid +H1+H3 62.5 43.4 8 51.2 8 20.0 7 10.2 9 13.6 63.5 4 34.7 9 44.9 6 66.3 1 37.4 6 47.8 8 57.8 9 52.0 4 54.8 1 Hybrid +H1+H2+H3 62.1 5 43.2 4 51 20.0 7 10.2 9 13.6 67.3 6 36.8 8 47.6 7 68.9 8 38.9 7 49.8 1 57.6 5 51.8 4 54.5 9 Hybrid +H1+H2+H3 +H4 61.4 6 42.7 5 50.4 3 20.0 7 10.2 9 13.6 62.1 33.6 5 43.6 5 63.1 35.6 5 45.5 6 55.2 1 49.6 1 52.2 6
  • 19.
    Discussion Reranking with contextis effective and brings improvement over the baseline for all corpora Improvement also on AIDA-YAGO2, even though it stems from a neutral topic domain. This is because MEANTIME and AIDA-YAGO2 share the genre domain, and many of the entities in MEANTIME stem from the neutral domain as well H1 (Order of Processing) and H3 (Domain Relevance) with these settings are the most effective heuristics H4 (Semantic Typing) requires further investigations 19
  • 20.
    Future Work Model thegenre and topic domains to contextualize further the entity linking, i.e. adding more features to improve our adaptive contextual model Investigate the dynamic adaptability in different contexts using knowledge bases as inputs 20
  • 21.