#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Language and Domain Independent Entity Linking with Quantified Collective Validation
1. Language and Domain
Independent Entity Linking
with Quantified Collective
Validation
Han Wang, Jin Guang Zheng, Xiaogang Ma, Peter Fox, and Heng Ji
EMNLP2015
Presented by: Shuangshuang Zhou
Inui&Okazaki Lab. Tohoku University
# スライドの中の絵は著者の論⽂・ポストーから拝借
2. An example to explain the task
One day after released by the Patriots, Florida born Caldwell
visited the Jet. ...
The New York Jets have six receivers on the roster: Cotchery,
Coles, ...
New England Patriots
Reche, Caldwell Jerricho, Cotchery Laveranues, Coles
New York Jets
9/12/16 2
3. Motivation and Contribution
u “Most of the previous research extensively exploited the
linguistic features of the source documents in a supervised or
semi-supervised way”.
u Quantified Collective Validation can be applied to a new
language or domain:
u It can worked with limited linguistic resources.
u It can conduct more deliberate study on the KB.
u A collective way of aligning co-occurred mentions to the KB
with a further step to consider quantitatively differentiating
entity relations in the KB.
9/12/16 3
4. Approaches - Overview
Candidate Ranking
(Two ranking steps +
Quantified Collective
Validation)
Salience Ranking(SR) : measure
candidates’ importance without the
context using information entropy.
Context Similarity Ranking (CS) :
measures the structural similarity
between candidate
graphs using Jaccard Similarity.
Candidate Graph Collective Validation (CV)
9/12/16 4
5. Approaches – Salience Ranking
9/12/16 5
where R(c)is the relation set for c in the KB; H(r) is given by the below
equation; Et(r) is the tail entity set with c being the head entity and r being
the connecting relation in the KB; L(et) denotes the cardinality of the tail
entity set with et being the head entity in the KB. Sa(c) is recursively
computed until convergence.
Measure a candidate’s (entity) importance without the context using
information entropy,
(eh,r,et)
tuple format in KB
6. Approaches – Context Similarity Ranking
9/12/16 6
Measures the structural similarity between candidate
graphs using Jaccard Similarity.
Whether two co-occurring mentions have their entity
referents connected by some relation in the KB.
The more a Gi
c is structurally similar to its Gm, the
better the candidates in this Gi
c represent their
mentions in Gm.
7. Approaches – Mention Context Graph
9/12/16 7
Gm is a light-weight source context
representation which simply involves
mention co-occurrence.
• There will be an edge between two
mention vertices if both of them fall into
a context window in the source
document.
• Two mention vertices will be connected
via a dashed edge if they are
coreferential but are not located in the
same context window.One day after released by
the Patriots, Florida born
Caldwell visited the Jet.
...
The New York
Jets have six
receivers on the
roster: Cotchery,
Coles, ...
8. Approaches – KB Graph
9/12/16 8
GK is a weighted graph that consists of a set of vertices representing the
entities and a set of directed edges labeled with relations between entities.
A “wiki link” relation is added between two entities if one of them appears in the
Wikipedia article of the other.
9. Approaches – Candidate Graphs
9/12/16 9
Gc is a series of graphs each of which
represents a collective linking solution to
the given mentions.
• Two vertices are connected if they are
also connected in GK by some relation r
and their mentions are connected in Gm.
The edge label r is transferred from GK.
10. Approaches – Context Similarity Ranking
9/12/16 10
Measures the structural similarity between candidate
graphs using Jaccard Similarity.
Whether two co-occurring mentions have their entity
referents connected by some relation in the KB.
The more a Gi
c is structurally similar to its Gm, the
better the candidates in this Gi
c represent their
mentions in Gm.
11. Approaches – Candidate Graph Collective
Validation
9/12/16 11
• Assumption: a “tighter” relation between two candidates is more
likely to be an appropriate representation of the relation between
their co-occurring mentions in the source context.
• Quantitatively differentiates different types of relations using the
calculated relation weights in GK.
adding effects of “tighter” relations
salience ranking
context similarity ranking
12. Experiments - Generic English Corpora
9/12/16 12
Experiments on TAC-KBP 2013 linkable mentions
Baseline:
Compared with top3 supervised and top3 unsupervised
systems from TAC KBP 2013
Error Analysis:
1) context capturing is deficient
2) simple coreference rules
3) certain relations are missing in the KB.
(Zheng et ,al 2014)
13. Experiments - Generic English Corpora
9/12/16 13
• SR outperforms the best
KBP unsupervised system
(0.632).
• Although CS did not produce
a lot more correct linking
results than SR did, but it
promote a great number of
good candidates to the top
of the ranking list.
• CS is deficient in
recognizing the subtle
contextual difference
among similar candidates
(the same type).
14. Experiments - Generic Chinese Corpora
9/12/16 14
• Fahrnl et al.(2012) used
over 20 fine-tuned
features and many
linguistic resource.
• Error Analysis:
• A Low recall on
mapping candidates
between English and
Chinese
15. Experiments – Specific domain
9/12/16 15
• There is slight improvement in biomedical
science because candidates of the related
mentions mostly have similar relations in
the KB.
• First study on earth
science domain
• Errors Analysis:
• There are biased
effects caused by
salience ranking
when using generic
KB
• Some relations are
not clearly defined
in DBpedia.
16. Conclusion and Future work
u QCV has minimal reliance on linguistic analysis and
the deep utilization of structured KBs.
u The demonstrated a high-performance EL approach
that can be migrated to new languages and
domains.
u They plan to better extract mention context and
incorporate the impact of more distance KB entities
other than just the neighbors.
9/12/16 16
17. 感想 (I)
u For conversational collective ranking of unsupervised EL
approaches, time complexity is a significant problem, the
upper bound of the computing time to link all mentions in a
documents is O(nm*nc*nnc*nnm).
u It is worth learning that they gave intensive analysis on
each experiment result.
u Since their method is less reliable on source document, it
could be also applied on short texts (Twitters, queries of
search engines).
9/12/16 17
18. 感想 (II)
u Their approach may not effective when there is seldom co-
occurring mentions.
u We expect their system performance cross more generic
English corpus.
u Their method worked on linkable mentions, but their
method can not solve unlinkable mentions (NILs).
u For a new language and a new domain, they used the same
KB (DBpedia), and their system performance was effected
by the structured KB. So their performance need to be
verified with new KBs.
9/12/16 18