Bianca Pereira
From Entity Recognition to
Entity Linking
07/05/2014
Based on the paper “From Entity Recognition to Entity Linking: a Survey of
Advanced Entity Linking Techniques” from Dai et al. 2012
Outline
• Motivation
• Overview of Entity Linking
• Instance-based Entity Linking Approach
• Experiments
• Conclusion
• Analysis of the Paper
• Relation with my PhD
THE BATTLE OF THE BOOGIE
Named Entity Recognition
k
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
Named Entity Recognition
k
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
Overview of Entity Linking
Databases
Biomedical
Natural Language
Processing
AI
Databases
Biomedical
NLP
AI
Entity Linking
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 20/11/2013)
http://www.discogs.com/artist
/87624-Mick-Jackson
http://www.discogs.com/artist/6432
65-Elmar-Krohn?noanv=1
http://www.discogs.com/artist/49239
6-Dave-Jackson-2?noanv=1
http://www.discogs.com/artist/
148391-Sylvester-Levay
http://www.discogs.com/artist/16
9154-Jacksons-The
Tasks Inspired Entity Linking
Link-The-Wiki Track in INEX Web People Search Task in
SemEval
URL1
URL2
URLn
…
Person 1 Person 2
Person 3
Person 4
Entity Linking Tasks
Entity Linking in TAC-KBP Gene Normalization in
BioCreative
http://www.discogs.com/artist
/87624-Mick-Jackson
NIL
Syntenin-1 ID:100754014
ID:6386mda-9
…
Problem Definition
 Article-wide Salient Entity Linking Problem
 Article-wide Entity Linking Problem
 Instance-based Entity Linking Problem
Article-wide Salient EL Problem
Source: http://en.wikipedia.org/wiki/Michael_jordan (visited in 20/11/2013)
Article-wide EL Problem
Source: http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
Instance-based EL Problem
Source: http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
Instance-based Entity Linking Approach
Instance-based Entity Linking Approach
Challenges
1. Lack of suitable corpus for developing instance-based EL
systems.
2. Lack of context information for disambiguating each
individual instance.
The synthetic replicate of urocortin was found to bind with high
affinity to type 1 and type 2 CRF receptors and, based upon its
anatomic localization within the brain, was proposed to be a
natural ligand for the type 2 CRF receptors.
Classification
Local Classification
URL1 URL2 URL3 URL4
Classification
Local Classification Relational Classification
URL1 URL2 URL3 URL4
URL1 URL2 URL3 URL4
Classification
Local Classification Relational Classification
URL1 URL2 URL3 URL4
URL1 URL2 URL3 URL4
URL9
9
Collective Classification
URL1 URL2 URL3 URL4
URL3
5
URL4
7
URL9
9 URL1
5
URL2
0
URL5
URL1
3
Collective Entity Disambiguation
1. Discourse Salience
In a given discourse there is precisely one entity that is the center of
attention.
2. Transitivity
If two mentions refer to the same entity, and one mention has been
linked to a database entry, the other should also be linked to the same entry.
Markov Logic Network Formulation
Observed Features
Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
ID1 ID2 ID3 ID4
ID2
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Markov Logic Network Formulation
Observed Features
Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
Observed Features of the Neighbors
Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
ID1
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Markov Logic Network Formulation
Observed Features
Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
Observed Features of the Neighbors
Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
Unobserved Features of the Neighbors
Protein-protein interaction: LinkTo(x,idi) ^ Candidate(y, idj) ^ PPIPartner(idi,
idj) => LinkTo(y, idj)
Syntanin-1 mda-9
ID1
ID2
ID9
Collective INFERENCE
URL1 URL2 URL3 URL4
URL3
5
URL4
7
URL9
9 URL1
5
URL2
0
URL5
URL1
3
Joint Inference
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
syntetin-1 syntetin-1
syntetin
mda-9
TACIP18
Joint Inference
New Constraints
Transitivity2: Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) =>
LinkTo(y, idi)
URL5
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
?
Joint Inference
New Constraints
Transitivity2: Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) =>
LinkTo(y, idi)
Coreference(x,y) => SuitablyLink(x) ^ SuitablyLink(y)
LinkTo(x,id) => SuitablyLink(x)
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Experiments
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Syntenin-1
URL5
U
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Human and rat syntenin-1 The mammalian syntenin-1
Corpus – Gene Mention Recognition
Set Precision Recall F-Measure
Training 55.3 83.4 66.5
Test 66.2 82.7 65.1
Corpus– Training Set
0
10
20
30
40
50
60
70
80
90
Precision Recall Fmeasure
Optimal Linking
Best Linking
Worst Linking
Corpus – Test Set
0
10
20
30
40
50
60
70
80
90
Precision Recall F-Measure
Optimal Linking
Best Linking
Worst Linking
Evaluation
Training Test
Feature P R F P R F
Saliency Discourse 79.2 50.2 61.5 79.5 59.0 67.7
Protein-protein Interaction 79.4 51.1 62.2 80.1 59.8 68.5
Transitivity 78.5 49.5 60.7 78.6 58.8 67.2
Evaluation
Training Test
P R F P R F
Random Baseline 68.4 51.6 58.8 68.3 59.8 63.
8
Collective 79.1 52.0 62.8 78.4 61.0 68.
6
Collective + Filtering 79.3 52.0 62.9 78.8 61.0 68.
8
Individual 74.9 54.3 62.9 75.7 61.7 68.
0
Collective + Individual 74.5 55.7 63.7 74.9 64.8 69.
5
Collective + Individual + Filtering 79.9 54.9 65.1 77.8 65.3 71.
0
Conclusion
- Overview of Entity Linking
- Why is Instance-based Entity Linking more challenging?
- Suggestion of a solution to the problem
Analysis
Cons
- The results do not lead to any conclusion.
- Too much abbreviations in the paper.
- Does the approach converge to a optimal solution?
- How long does it take to give a solution?
- Is there any case that could not be disambiguated by
human annotators?
Pros
- Outlier
- “The instance-based EL task requires deeper linguistic
analysis and domain dependent knowledge to infer each
instance’s identity.”
Databases
Biomedical
Natural Language
Processing
AI
Semantic Web
How is it Related to my PhD?
I am working on the Entity Linking Topic.
 Generic Approach
 Focus on Linguistic Features
 Linked Data as Knowledge Base
 Scalability
Thank you!

Reading Group 2014 (Insight NUIG)

  • 1.
    Bianca Pereira From EntityRecognition to Entity Linking 07/05/2014 Based on the paper “From Entity Recognition to Entity Linking: a Survey of Advanced Entity Linking Techniques” from Dai et al. 2012
  • 2.
    Outline • Motivation • Overviewof Entity Linking • Instance-based Entity Linking Approach • Experiments • Conclusion • Analysis of the Paper • Relation with my PhD
  • 3.
    THE BATTLE OFTHE BOOGIE
  • 5.
    Named Entity Recognition k Source:http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
  • 6.
    Named Entity Recognition k Source:http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
  • 8.
  • 9.
  • 10.
  • 11.
    Entity Linking Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer)(visited in 20/11/2013) http://www.discogs.com/artist /87624-Mick-Jackson http://www.discogs.com/artist/6432 65-Elmar-Krohn?noanv=1 http://www.discogs.com/artist/49239 6-Dave-Jackson-2?noanv=1 http://www.discogs.com/artist/ 148391-Sylvester-Levay http://www.discogs.com/artist/16 9154-Jacksons-The
  • 12.
    Tasks Inspired EntityLinking Link-The-Wiki Track in INEX Web People Search Task in SemEval URL1 URL2 URLn … Person 1 Person 2 Person 3 Person 4
  • 13.
    Entity Linking Tasks EntityLinking in TAC-KBP Gene Normalization in BioCreative http://www.discogs.com/artist /87624-Mick-Jackson NIL Syntenin-1 ID:100754014 ID:6386mda-9 …
  • 14.
    Problem Definition  Article-wideSalient Entity Linking Problem  Article-wide Entity Linking Problem  Instance-based Entity Linking Problem
  • 15.
    Article-wide Salient ELProblem Source: http://en.wikipedia.org/wiki/Michael_jordan (visited in 20/11/2013)
  • 16.
    Article-wide EL Problem Source:http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
  • 17.
    Instance-based EL Problem Source:http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
  • 18.
  • 19.
    Instance-based Entity LinkingApproach Challenges 1. Lack of suitable corpus for developing instance-based EL systems. 2. Lack of context information for disambiguating each individual instance. The synthetic replicate of urocortin was found to bind with high affinity to type 1 and type 2 CRF receptors and, based upon its anatomic localization within the brain, was proposed to be a natural ligand for the type 2 CRF receptors.
  • 20.
  • 21.
    Classification Local Classification RelationalClassification URL1 URL2 URL3 URL4 URL1 URL2 URL3 URL4
  • 22.
    Classification Local Classification RelationalClassification URL1 URL2 URL3 URL4 URL1 URL2 URL3 URL4 URL9 9
  • 23.
    Collective Classification URL1 URL2URL3 URL4 URL3 5 URL4 7 URL9 9 URL1 5 URL2 0 URL5 URL1 3
  • 24.
    Collective Entity Disambiguation 1.Discourse Salience In a given discourse there is precisely one entity that is the center of attention. 2. Transitivity If two mentions refer to the same entity, and one mention has been linked to a database entry, the other should also be linked to the same entry.
  • 25.
    Markov Logic NetworkFormulation Observed Features Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id) ID1 ID2 ID3 ID4 ID2 …Here, we demonstrate that rat syntetin-1, previously published as syntenin-1 (syntenin), mda-9, or TACIP18 in human, is a neurofascin-binding protein that exhibits a wide- spread tissue expression pattern with a relative maximum in brain. …
  • 26.
    Markov Logic NetworkFormulation Observed Features Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id) Observed Features of the Neighbors Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi) ID1 …Here, we demonstrate that rat syntetin-1, previously published as syntenin-1 (syntenin), mda-9, or TACIP18 in human, is a neurofascin-binding protein that exhibits a wide- spread tissue expression pattern with a relative maximum in brain. …
  • 27.
    Markov Logic NetworkFormulation Observed Features Saliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id) Observed Features of the Neighbors Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi) Unobserved Features of the Neighbors Protein-protein interaction: LinkTo(x,idi) ^ Candidate(y, idj) ^ PPIPartner(idi, idj) => LinkTo(y, idj) Syntanin-1 mda-9 ID1 ID2 ID9
  • 28.
    Collective INFERENCE URL1 URL2URL3 URL4 URL3 5 URL4 7 URL9 9 URL1 5 URL2 0 URL5 URL1 3
  • 29.
    Joint Inference …Here, wedemonstrate that rat syntetin-1, previously published as syntenin-1 (syntenin), mda-9, or TACIP18 in human, is a neurofascin-binding protein that exhibits a wide- spread tissue expression pattern with a relative maximum in brain. … Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi) syntetin-1 syntetin-1 syntetin mda-9 TACIP18
  • 30.
    Joint Inference New Constraints Transitivity2:Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) => LinkTo(y, idi) URL5 …Here, we demonstrate that rat syntetin-1, previously published as syntenin-1 (syntenin), mda-9, or TACIP18 in human, is a neurofascin-binding protein that exhibits a wide- spread tissue expression pattern with a relative maximum in brain. … ?
  • 31.
    Joint Inference New Constraints Transitivity2:Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) => LinkTo(y, idi) Coreference(x,y) => SuitablyLink(x) ^ SuitablyLink(y) LinkTo(x,id) => SuitablyLink(x) …Here, we demonstrate that rat syntetin-1, previously published as syntenin-1 (syntenin), mda-9, or TACIP18 in human, is a neurofascin-binding protein that exhibits a wide- spread tissue expression pattern with a relative maximum in brain. …
  • 32.
  • 33.
    Corpus IGML Corpus (Instance-basedGene Mention Linking) Training Set Test Set Number of articles 282 262 Number of gene mentions 2,813 3,143 Number of linked Entrez Gene IDs 2,861 3,187 Number of words per article 215.86 228.91 Number of mentions per article 10.01 12.00 Number of words per mention 1.52 1.35 Number of IDs per mention 1.02 1.01
  • 34.
    Corpus IGML Corpus (Instance-basedGene Mention Linking) Training Set Test Set Number of articles 282 262 Number of gene mentions 2,813 3,143 Number of linked Entrez Gene IDs 2,861 3,187 Number of words per article 215.86 228.91 Number of mentions per article 10.01 12.00 Number of words per mention 1.52 1.35 Number of IDs per mention 1.02 1.01 Syntenin-1 URL5 U
  • 35.
    Corpus IGML Corpus (Instance-basedGene Mention Linking) Training Set Test Set Number of articles 282 262 Number of gene mentions 2,813 3,143 Number of linked Entrez Gene IDs 2,861 3,187 Number of words per article 215.86 228.91 Number of mentions per article 10.01 12.00 Number of words per mention 1.52 1.35 Number of IDs per mention 1.02 1.01 Human and rat syntenin-1 The mammalian syntenin-1
  • 36.
    Corpus – GeneMention Recognition Set Precision Recall F-Measure Training 55.3 83.4 66.5 Test 66.2 82.7 65.1
  • 37.
    Corpus– Training Set 0 10 20 30 40 50 60 70 80 90 PrecisionRecall Fmeasure Optimal Linking Best Linking Worst Linking
  • 38.
    Corpus – TestSet 0 10 20 30 40 50 60 70 80 90 Precision Recall F-Measure Optimal Linking Best Linking Worst Linking
  • 39.
    Evaluation Training Test Feature PR F P R F Saliency Discourse 79.2 50.2 61.5 79.5 59.0 67.7 Protein-protein Interaction 79.4 51.1 62.2 80.1 59.8 68.5 Transitivity 78.5 49.5 60.7 78.6 58.8 67.2
  • 40.
    Evaluation Training Test P RF P R F Random Baseline 68.4 51.6 58.8 68.3 59.8 63. 8 Collective 79.1 52.0 62.8 78.4 61.0 68. 6 Collective + Filtering 79.3 52.0 62.9 78.8 61.0 68. 8 Individual 74.9 54.3 62.9 75.7 61.7 68. 0 Collective + Individual 74.5 55.7 63.7 74.9 64.8 69. 5 Collective + Individual + Filtering 79.9 54.9 65.1 77.8 65.3 71. 0
  • 41.
    Conclusion - Overview ofEntity Linking - Why is Instance-based Entity Linking more challenging? - Suggestion of a solution to the problem
  • 42.
  • 43.
    Cons - The resultsdo not lead to any conclusion. - Too much abbreviations in the paper. - Does the approach converge to a optimal solution? - How long does it take to give a solution? - Is there any case that could not be disambiguated by human annotators?
  • 44.
    Pros - Outlier - “Theinstance-based EL task requires deeper linguistic analysis and domain dependent knowledge to infer each instance’s identity.”
  • 45.
  • 46.
    How is itRelated to my PhD? I am working on the Entity Linking Topic.  Generic Approach  Focus on Linguistic Features  Linked Data as Knowledge Base  Scalability
  • 47.

Editor's Notes

  • #8 In order to perform better Information Retrieval more information about the webpages are indexed (like the entities) But there are lots of cases of synonym and ambiguity
  • #10 DB: object identification, data de-duplication AI: entity resolution, name matching Bio: term identification, mapping, normalization NLP: co-reference resolution Others: record linkage, name identity uncertainty, entity disambiguation
  • #12 The resolution of URL is given by DB domain The information from text from NLP Entity Linking is the task to fill this gap
  • #13 Senseval - International Workshop on Semantic Evaluations
  • #14 Senseval - International Workshop on Semantic Evaluations - The difference: BioCreative there is the NER step/TAC-KBP recognizes NIL
  • #21 Local: Observed features of n.
  • #22 Local: Observed features of n. Relational: Observed features of nodes in the neighborhood of n
  • #23 Local: Observed features of n. Relational: Observed features of nodes in the neighborhood of n
  • #24 Local: Observed features of n. Relational: Observed features of nodes in the neighborhood of n Collective: Unobserved labels(URI) of nodes in the neighborhood of n
  • #30 Resolution of each mention separated could lead to a problem.. (rat) Syntetin-1 and (human) syntetin-1 could be mapped for the same entity
  • #31 The formula captures the idea that we ask for neighbors help only if the context does not provide enough information for disambiguation.
  • #32 One mention which was not recognized as a mention before disambiguation is not “suitable for linking” at disambiguation step.
  • #38 All PubMed abstracts are curated by humans. They applied a parser for mention, found the candidates and tried to get from the human annotation the correct ID. Optimal – Get the annotator ID even when it is not in the candidate set Worst – Take wrong candidate Best – Take right candidate
  • #44 - Approximative vs exact NP problem vs Time- I saw David and Victoria walking at Times Square (David Bowie and Victoria Beckham)
  • #46 DB: object identification, data de-duplication AI: entity resolution, name matching Bio: term identification, mapping, normalization NLP: co-reference resolution Others: record linkage, name identity uncertainty, entity disambiguation
  • #47 Generic – don’t focus on Biomedical and use of PPI especifically