SlideShare a Scribd company logo
Systematic Study of
Long Tail Phenomena in
Entity Linking
Filip Ilievski, Piek Vossen, Stefan Schlobach
Entity Linking (EL)
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington...
Smith, 33, originally entered ...”
(https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
surface form
instance
interpretation
State-of-the-art Entity Linking
SotA: High F1-scores by probabilistic optimization
F1-score
=> system skills ??
=> errors ??
~ data properties ??
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington...
Smith, 33, originally entered ...”
(https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
Head and tail of Entity Linking
Claim: performance (head) >> performance (tail)
(Ilievski et al., 2016; van Erp et al., 2016; Esquivel et al., 2017)
head =? ∧ tail=?
=> performance (head) >> performance (tail) ??
=> how to improve performance (tail) ??
Contributions of this work
1. Description and hypotheses on the long tail properties of EL
2. Analysis of EL datasets WRT the long tail properties
3. Analysis of system performance WRT the long tail properties
4. Recommended actions
Ambiguity of forms
(number of different instances that a form refers to)
“Washington “
Variance of instances
(number of distinct forms that refer to an instance)
“... U.S. federal government” “Washington” “... government of U.S.
...”
Frequency of forms/instances
(number of occurrences in a corpus)
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington.
Smith, 33, originally entered ...”
Popularity of instances
(PageRank in a knowledge graph)
Definition of long tail properties
Hypotheses and setup
16 hypotheses
2 data collections (CoNLL-AIDA and N3), 5 corpora in total
3 SotA systems: AGDISTIS MAG, DBpedia Spotlight, and WAT
Precision, recall and F1-score
Hypotheses on the data properties
Positive correlation between ambiguity and frequency of forms amb(f) ~ freq(f)
Positive correlation between variance, frequency, and popularity of instances var(i) ~ freq(i)
var(i) ~ pop(i)
freq(i) ~ pop(i)
Zipfian frequency distribution within all forms that refer to an instance freq(f|I) ~ zipfian
Zipfian frequency distribution within all instances that refer to a form freq(i|F) ~ zipfian
amb(f) ~ freq(f) var(i) ~ freq(i)
freq(i) ~ pop(i)var(i) ~ pop(i)
freq(f) ~ zipfian freq(i) ~ zipfian
Hypotheses on system performance
Systems perform worse on forms that are ambiguous than overall. f1(AMF) << f1(ALL)
Best performance on frequent, non-ambiguous forms;
worst performance on infrequent, highly ambiguous forms.
f1(freq, ⅂amb) = MAX(f1)
f1(⅂freq, amb) = MIN(f1)
Performance is inversely proportional with entropy. f1(AMF) ~ ⅂entropy(AMF)
Systems perform better on frequent/popular instances of ambiguous forms,
compared to their infrequent/unpopular instances.
f1(i|F) ~ freq(i|F)
f1(i|F) ~ pop(i|F)
f1(AMF) << f1(ALL)
f1(freq, ⅂amb) = MAX(f1)
f1(⅂freq, amb) = MIN(f1)
S4: Systems perform better on ambiguous forms with imbalanced,
compared to balanced, instance distribution
f1(AMF) ~ ⅂entropy(AMF)
f1(amb)<< f1(all)
f1(i|F) ~ freq(i|F)
f1(i|F) ~ pop(i|F)
Recommendations
[Dataset creation]
● statistics on the head and the tail
● most-frequent-value baseline
[Evaluation]
● evaluate on the head and the tail
● use macro F1-score
[System development]
● which heuristics target which cases
● which resources optimize for the head/tail
Conclusions
First work that systematically describes the relation of surface forms in EL corpora and their
instances in DBpedia, through long tail properties.
We measured expected inter-correlations between long tail phenomena in EL datasets.
System performance correlates positively with frequency and popularity of instances, and
negatively with ambiguity of forms.
We listed recommended actions to influence future designs of systems and datasets in EL.
Thanks for your attention!
Questions?
Github: cltl/EL-long-tail-phenomena
Twitter: @earthling91

More Related Content

More from Filip Ilievski

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
Filip Ilievski
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
Filip Ilievski
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
Filip Ilievski
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
Filip Ilievski
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
Filip Ilievski
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Filip Ilievski
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
Filip Ilievski
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Filip Ilievski
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
Filip Ilievski
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
Filip Ilievski
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
Filip Ilievski
 

More from Filip Ilievski (11)

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
 

Recently uploaded

waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 

Recently uploaded (20)

waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 

Systematic Study of Long Tail Phenomena in Entity Linking

  • 1. Systematic Study of Long Tail Phenomena in Entity Linking Filip Ilievski, Piek Vossen, Stefan Schlobach
  • 2. Entity Linking (EL) “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington... Smith, 33, originally entered ...” (https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/) surface form instance interpretation
  • 3. State-of-the-art Entity Linking SotA: High F1-scores by probabilistic optimization F1-score => system skills ?? => errors ?? ~ data properties ?? “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington... Smith, 33, originally entered ...” (https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
  • 4. Head and tail of Entity Linking Claim: performance (head) >> performance (tail) (Ilievski et al., 2016; van Erp et al., 2016; Esquivel et al., 2017) head =? ∧ tail=? => performance (head) >> performance (tail) ?? => how to improve performance (tail) ??
  • 5. Contributions of this work 1. Description and hypotheses on the long tail properties of EL 2. Analysis of EL datasets WRT the long tail properties 3. Analysis of system performance WRT the long tail properties 4. Recommended actions
  • 6. Ambiguity of forms (number of different instances that a form refers to) “Washington “ Variance of instances (number of distinct forms that refer to an instance) “... U.S. federal government” “Washington” “... government of U.S. ...” Frequency of forms/instances (number of occurrences in a corpus) “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington. Smith, 33, originally entered ...” Popularity of instances (PageRank in a knowledge graph) Definition of long tail properties
  • 7. Hypotheses and setup 16 hypotheses 2 data collections (CoNLL-AIDA and N3), 5 corpora in total 3 SotA systems: AGDISTIS MAG, DBpedia Spotlight, and WAT Precision, recall and F1-score
  • 8. Hypotheses on the data properties Positive correlation between ambiguity and frequency of forms amb(f) ~ freq(f) Positive correlation between variance, frequency, and popularity of instances var(i) ~ freq(i) var(i) ~ pop(i) freq(i) ~ pop(i) Zipfian frequency distribution within all forms that refer to an instance freq(f|I) ~ zipfian Zipfian frequency distribution within all instances that refer to a form freq(i|F) ~ zipfian
  • 9. amb(f) ~ freq(f) var(i) ~ freq(i) freq(i) ~ pop(i)var(i) ~ pop(i)
  • 10. freq(f) ~ zipfian freq(i) ~ zipfian
  • 11. Hypotheses on system performance Systems perform worse on forms that are ambiguous than overall. f1(AMF) << f1(ALL) Best performance on frequent, non-ambiguous forms; worst performance on infrequent, highly ambiguous forms. f1(freq, ⅂amb) = MAX(f1) f1(⅂freq, amb) = MIN(f1) Performance is inversely proportional with entropy. f1(AMF) ~ ⅂entropy(AMF) Systems perform better on frequent/popular instances of ambiguous forms, compared to their infrequent/unpopular instances. f1(i|F) ~ freq(i|F) f1(i|F) ~ pop(i|F)
  • 13. f1(freq, ⅂amb) = MAX(f1) f1(⅂freq, amb) = MIN(f1)
  • 14. S4: Systems perform better on ambiguous forms with imbalanced, compared to balanced, instance distribution f1(AMF) ~ ⅂entropy(AMF)
  • 15. f1(amb)<< f1(all) f1(i|F) ~ freq(i|F) f1(i|F) ~ pop(i|F)
  • 16. Recommendations [Dataset creation] ● statistics on the head and the tail ● most-frequent-value baseline [Evaluation] ● evaluate on the head and the tail ● use macro F1-score [System development] ● which heuristics target which cases ● which resources optimize for the head/tail
  • 17. Conclusions First work that systematically describes the relation of surface forms in EL corpora and their instances in DBpedia, through long tail properties. We measured expected inter-correlations between long tail phenomena in EL datasets. System performance correlates positively with frequency and popularity of instances, and negatively with ambiguity of forms. We listed recommended actions to influence future designs of systems and datasets in EL.
  • 18. Thanks for your attention! Questions? Github: cltl/EL-long-tail-phenomena Twitter: @earthling91