All That Glitters is not Gold

Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo1,2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
All That Glitters is not Gold
Rule-Based Curation of Reference Datasets for
Named Entity Recognition and Entity Linking

30th
May 2017 Jha et al. — All That Glitters is not Gold 2
Outline
 Motivation

30th
Outline
 Motivation
 Rule set

30th
Outline
 Motivation
 Rule set
 Error types

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
∑

30th
Motivation

30th
Motivation
KB

30th
Motivation
Annotation of texts (A2KB)
Bosch and Sharp are
both home appliances
producing companies.
Example from KORE50 dataset

30th
Motivation
Named Entity Recognition
Bosch and Sharp are
Example from KORE50 dataset

30th
KB
Motivation
Entity Linking
Bosch and Sharp are
dbr:Robert_Bosch_GmbH
dbr:Sharp_Corporation

30th
Motivation
Evaluation
System
=
?

30th
Motivation
Evaluation
System
=
?
Bosch and Sharp are
dbr:Sharp

30th
Motivation
Evaluation
System
=
?
Bosch and Sharp are
dbr:Sharp
How can we check our gold standards?

30th
Motivation
Approach
 Set of annotation rules for gold standards
 Error types violating these rules
 Tool for the semi-automatic check of gold standards

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary

30th
 A1: A single sentence has a linear structure
Rule Set
Assumptions
Barack and Michelle Obama

30th
 A2: The annotation should cover as many consecutive words as
possible
- Name the entity as precisely as possible
Rule Set
Assumptions
legendary cryptanalyst Alan Turing

30th
 A3: Each annotation should be linked to the most precise
resource of the KB
Rule Set
Assumptions
113th
United States Congress
dbr:113th_United_States_Congress
dbr:United_States_Congress
X

30th
 A4: The annotated string should point to a specific entity
 A5: A set of entity types TA
is given to define which entities can
be found and which resources of a KB that can be used for
linking
Rule Set
Assumptions
TA
= {dbo:Person, dbo:Place, dbo:Organisation}

30th
 R1 dataset and documents
- Each Dataset D is a set of documents
- Each document d is an ordered set of words
d={w1
,...,wn
}
Rule Set

30th
 R2 words
- Each word wi
d∈ is a sequence of characters or digits
starting
 at the beginning of the document or
 after a whitespace
- And ending
 at the end of the document or
 before a whitespace or punctuation character.
Rule Set

30th
 R3 entities for annotation
- The annotation process relies on a set of entities
- E might contain emerging entities (EEs)
Rule Set
E={e|τ(e)∩T A≠∅}

30th
 R4 annotation
-
(a) is a sequence of consecutive words
(b) Is a URI that links the sequence to an entity
i. e is the most precise entity possible
ii. that represents a as described in A3
Rule Set
Sa
ua e=δ(ua)
a=(Sa ,ua)

30th
 R5 annotation function
(a)
(b)
(c) has to be complete
Rule Set
A={a1, ... ,am}
ρ(d , K , E ,T A)=A
δ(uai
)∈E
∀ ai ,a j∈A∧(Sai
,Sa j
⊂d ),(Sai
∩Sa j
=∅)
A

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary

30th
 Positioning error
violates rules 2 + 4(a)
Error types
Müller_scored a hattrick against England.
[...], a performance space that opened in 2006 [...]
dbr:Thomas_Müller_(footballer)
dbr:Man
Examples from DBpedia Spotlight and KORE50 datasets
dbr:England_national_football_team

30th
 Ovelapping error
violates rule 5(b)
Error types
The only accident engineers said, was when one Google car
was rear-ended while stopped at a traffic light.
dbr:Car
Example from DBpedia Spotlight dataset
dbr:Google_driverless_car

30th
 Combined marking
violates rule 4(b)i
Error types
In December 2012, [...]
dbr:December
dbr:2012

30th
 Long description error
violates rule 4(b)ii
Error types
The car is a project of Google, which has been working in
secret but in plain view on vehicles that can drive themselves, [...]
dbr:Driverless_car

30th
 Missing entity
violates rule 5(c)
- Inconsistent marking
 URI errors
- Outdated URI
dbr:People’s_Republic_of_China → dbr:China
- Disambiguation URI
dbr:Teresa
- Invalid URI
*null*
Error types

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary

30th
Eaglet

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary

30th
Evaluation
Errors found

30th
 Only 4 datasets come with a set of Types
- 25 documents of ACE
- 25 documents of AIDA/CoNLL
- 30 documents of OKE 2015
Evaluation
Quality of identified errors
T A

30th
 URI errors have 0.94 accuracy
 Minor problems with the CM module, e.g.
Evaluation
Example from AIDA/CoNLL dataset
Steve Pagani VIENNA
Interannotator agreement in brackets

30th
Evaluation
Influence on evaluation

30th
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
∑

30th
Summary
 NER/EL gold standards can contain severe errors

30th
Summary
 For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)

30th
Summary
 For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)
 We showed the quality of gold standards has an impact on the
evaluation results

Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
roeder@informatik.uni-leipzig.de
https://github.com/aksw/eaglet
Thanks for your attention!
This work has been supported by the H2020 project HOBBIT (GA no. 688227) as well as the the EuroStars projects DIESEL (project no.
01QE1512C) and QAMEL (project no. 01QE1549C).

30th
 Completion module
- 10 annotation systems
- 5 have to “vote” for an annotation to suggest it to the user
- Missed entities were found
 74% for ACE2004
 92% for AIDA/CoNLL
 57% for OKE2015
Evaluation

30th
Evaluation
Influence on evaluation

All That Glitters is not Gold

Recommended

Recommended

More Related Content

More from Holistic Benchmarking of Big Linked Data

More from Holistic Benchmarking of Big Linked Data (20)

Recently uploaded

Recently uploaded (20)

All That Glitters is not Gold