PhD Day – 04/2014
Bianca Pereira
The PhD Route
Outline
Literature Review
Define the PhD topic
DEFINING THE TOPIC
Entity Linking is..
“Grounding entity mentions in documents to
Knowledge Base entries”
- TAC-KBP 2009
Entity Resolution
http://en.wikipedia.org/wiki/The_Guardian http://en.wikipedia.org/wiki/National_Security_Agency
http://en.wikipedia.org/wiki/British_peoplehttp://en.wikipedia.org/wiki/Edward_snowden
PROBLEM SEEKING
Types of Entity
Domains of Knowledge
Methods
Accuracy
Time
Types of Entity
Named Entities Unamed Entities
Topics Classes
Natural Language Processing
Statistics
Entity Linking
Domains of Knowledge
Methods
EVERYTHING !
Natural Language Processing
Statistics
Entity Linking
PROBLEM DEFINITION
Types of Entity
Named Entities Given by Class
Given by Knowledge Base
Others
Types of Entity
Named Entities Given by Class
Given by Knowledge Base
Others
Domains of Knowledge
Domains of Knowledge
Cross-domain Knowledge Base
Methods
“(…) Collective Inference over a set of entities can lead
to better performance.”
- Stoyanov et al 2012
Named Entity Recognition Disambiguation
Named Entity Recognition
Disambiguation
http://en.wikipedia.org/wiki/Michael_Jackson
http://en.wikipedia.org/wiki/Popular_music
http://en.wikipedia.org/wiki/Beat_It
http://en.wikipedia.org/wiki/Billie_Jean
http://en.wikipedia.org/wiki/Thriller_(song)
Collective Inference are algorithms
for Disambiguation
Co
URI1
URI2
URI3
URI4URI5
URI6
URI7
URI8
URI9
URI10
A Local Context is used to give the
mention-candidate score
Co
URI1
There is coherence between
entities in the same document.
Co
URI1
URI2
URI3
URI4URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI2
URI3
URI4URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI2
URI3
URI4URI5
URI6
URI7
URI8
URI9
URI10
Disambiguation using collective
inference is a NP problem.
Co
URI1
URI2
URI3
URI4URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI4URI5
URI6
URI7
URI8
230
candidates
24
candidates
“The number of contexts [entities] is
overwhelming and had to be reduced to
a manageable size.”
- Cucerzan 2007
“Much speed is gained by imposing a
threshold below which all senses
[candidates] are discarded”
- Milne and Witten 2008
“Inference is NP Hard”
- Kulkarni et al 2009
“(…) exact algorithms on large
input graphs are infeasible.”
- Hoffart et al 2011
Collective Inference - Accuracy
Collective Inference - Time
Using approximation algorithms the
time is suitable for the task
Methods
Recalling
Given by Knowledge Base
Cross-domain
Knowledge
Base
~ 5 MILLION entities
~ 10 MILLION entities
~ 43 MILLION entities
Problem Statement
The time spent in disambiguation for Entity Linking
increases with the size of the Knowledge Base. It
turns the disambiguation with large Knowledge
Bases infeasible.
RELATED WORK
Two solutions for the problem..
1.  Approximation Algorithms
2.  Dimensionality Reduction
Approximation Algorithms
Kulkarni et al 2009, Hoffart et al 2011
Dimensionality Reduction
URI1
URI4URI5
URI6
URI7
URI8
230
24
URI1
URI4URI5
URI6
URI7
URI8
URI2
URI3
URI9
URI10
Cucerzan 2007, Milne and Witten 2008, Hoffart et al 2011
Dimensionality Reduction
(candidate space)
Algorithm
Knowledge Base
Dimensionality Reduction
(candidate space)
Algorithm
Knowledge Base
Related Work
Dimensionality Reduction
(candidate space)
Algorithm
Knowledge Base
Related Work
RESEARCH QUESTIONS
R1. Is it possible to delimit a feasible maximum
amount of time for disambiguation regardless of
the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality
directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for
disambiguation using large Knowledge Bases?
R1. Is it possible to delimit a feasible maximum
amount of time for disambiguation regardless of
the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality
directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for
disambiguation using large Knowledge Bases?
R1. Is it possible to delimit a feasible maximum
amount of time for disambiguation regardless of
the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality
directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for
disambiguation using large Knowledge Bases?
HYPOTHESES
R1. Is it possible to delimit a feasible
maximum amount of time for disambiguation
regardless of the size of the Knowledge
Base?
H1. There is a maximum size of candidate
set that allows disambiguation in a feasible
time.
R1. Is it possible to delimit a feasible
maximum amount of time for disambiguation
regardless of the size of the Knowledge
Base?
H2. If the Knowledge Base can be divided
in subsets of constant ambiguity then the
candidate space is constant.
R1. Is it possible to delimit a feasible
maximum amount of time for disambiguation
regardless of the size of the Knowledge
Base?
Subset of constant ambiguity
Candidate space constant
Candidate space = maximum allowed size
Feasible time
R2. Is it possible to reduce the
dimensionality directly in the Knowledge
Base?
H3. The relatedness between entities is a
sufficient condition to reduce the
dimensionality without loss of accuracy.
R3. Is it feasible to use exact algorithms
for disambiguation using large Knowledge
Bases?
H4. Decreasing the ambiguity in the
Knowledge Base is less time consuming
that perform it at disambiguation time.
R3. Is it feasible to use exact algorithms
for disambiguation using large Knowledge
Bases?
H5. Exact algorithms can be used in a
feasible time until a maximum size of
candidate space.
PROPOSED SOLUTION
Ontology Modularization for
Disambiguation in Entity Linking
Ontology Modularization
Ontology Modularization
How to Generate the Modules?
Semantic-Driven Strategies
Depends on the Application.
Structure-Driven Strategies
Graph Decomposition based on inter-relation.
Machine Learning Strategies
Data Mining and Clustering.
EVALUATION
H1. There is a maximum size of candidate set that
allows disambiguation in a feasible time.
H1. There is a maximum size of candidate set that
allows disambiguation in a feasible time.
Perform an experiment using different
collective inference approaches to discover
how the time increases with the size of the
candidate set.
H2. If the Knowledge Base can be divided in
subsets of constant ambiguity then the candidate
space is constant.
H2. If the Knowledge Base can be divided in
subsets of constant ambiguity then the candidate
space is constant.
Perform Ontology Modularization
aiming a maximum ambiguity in each
module.
H3. The relatedness between entities is a sufficient
condition to reduce the dimensionality without loss
of accuracy.
H3. The relatedness between entities is a sufficient
condition to reduce the dimensionality without loss
of accuracy.
Generate the module based on the
same relatedness measure used by the
original method and verify the accuracy.
H4. Decreasing the ambiguity in the Knowledge
Base is less time consuming that perform it at
disambiguation time.
H4. Decreasing the ambiguity in the Knowledge
Base is less time consuming that perform it at
disambiguation time.
Measure the time for disambiguation
r e d u c i n g t h e d i m e n s i o n a l i t y a t
disambiguation time and using the
Modularization approach.
H5. Exact algorithms can be used in a feasible
time until a maximum size of candidate space.
H5. Exact algorithms can be used in a feasible
time until a maximum size of candidate space.
Select a set of exact algorithms and
measure the time for different sizes of
candidate space.
Next Steps
Doctoral Consortium
TAC-KBP
First Experiments
Use Cases
Thank you!
Bianca Pereira
bianca.pereira@insight-centre.org

PhD Day: Entity Linking using Ontology Modularization