Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking.
Hobbit presentation at ESWC 2017
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
1. Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo1,2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
All That Glitters is not Gold
Rule-Based Curation of Reference Datasets for
Named Entity Recognition and Entity Linking
2. 30th
May 2017 Jha et al. — All That Glitters is not Gold 2
Outline
Motivation
3. 30th
May 2017 Jha et al. — All That Glitters is not Gold 3
Outline
Motivation
Rule set
4. 30th
May 2017 Jha et al. — All That Glitters is not Gold 4
Outline
Motivation
Rule set
Error types
5. 30th
May 2017 Jha et al. — All That Glitters is not Gold 5
Outline
Motivation
Rule set
Error types
Eaglet
6. 30th
May 2017 Jha et al. — All That Glitters is not Gold 6
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
∑
7. 30th
May 2017 Jha et al. — All That Glitters is not Gold 7
Motivation
8. 30th
May 2017 Jha et al. — All That Glitters is not Gold 8
Motivation
KB
9. 30th
May 2017 Jha et al. — All That Glitters is not Gold 9
Motivation
Annotation of texts (A2KB)
Bosch and Sharp are
both home appliances
producing companies.
Example from KORE50 dataset
10. 30th
May 2017 Jha et al. — All That Glitters is not Gold 10
Motivation
Named Entity Recognition
Bosch and Sharp are
both home appliances
producing companies.
Example from KORE50 dataset
11. 30th
May 2017 Jha et al. — All That Glitters is not Gold 11
KB
Motivation
Entity Linking
Bosch and Sharp are
both home appliances
producing companies.
dbr:Robert_Bosch_GmbH
dbr:Sharp_Corporation
12. 30th
May 2017 Jha et al. — All That Glitters is not Gold 12
Motivation
Evaluation
System
=
?
13. 30th
May 2017 Jha et al. — All That Glitters is not Gold 14
Motivation
Evaluation
System
=
?
14. 30th
May 2017 Jha et al. — All That Glitters is not Gold 15
Motivation
Evaluation
System
=
?
Bosch and Sharp are
both home appliances
producing companies.
dbr:Sharp
15. 30th
May 2017 Jha et al. — All That Glitters is not Gold 16
Motivation
Evaluation
System
=
?
Bosch and Sharp are
both home appliances
producing companies.
dbr:Sharp
How can we check our gold standards?
16. 30th
May 2017 Jha et al. — All That Glitters is not Gold 17
Motivation
Approach
Set of annotation rules for gold standards
Error types violating these rules
Tool for the semi-automatic check of gold standards
17. 30th
May 2017 Jha et al. — All That Glitters is not Gold 18
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
18. 30th
May 2017 Jha et al. — All That Glitters is not Gold 19
A1: A single sentence has a linear structure
Rule Set
Assumptions
Barack and Michelle Obama
19. 30th
May 2017 Jha et al. — All That Glitters is not Gold 20
A2: The annotation should cover as many consecutive words as
possible
- Name the entity as precisely as possible
Rule Set
Assumptions
legendary cryptanalyst Alan Turing
20. 30th
May 2017 Jha et al. — All That Glitters is not Gold 21
A3: Each annotation should be linked to the most precise
resource of the KB
Rule Set
Assumptions
113th
United States Congress
dbr:113th_United_States_Congress
dbr:United_States_Congress
X
21. 30th
May 2017 Jha et al. — All That Glitters is not Gold 22
A4: The annotated string should point to a specific entity
A5: A set of entity types TA
is given to define which entities can
be found and which resources of a KB that can be used for
linking
Rule Set
Assumptions
TA
= {dbo:Person, dbo:Place, dbo:Organisation}
22. 30th
May 2017 Jha et al. — All That Glitters is not Gold 23
R1 dataset and documents
- Each Dataset D is a set of documents
- Each document d is an ordered set of words
d={w1
,...,wn
}
Rule Set
23. 30th
May 2017 Jha et al. — All That Glitters is not Gold 24
R2 words
- Each word wi
d∈ is a sequence of characters or digits
starting
at the beginning of the document or
after a whitespace
- And ending
at the end of the document or
before a whitespace or punctuation character.
Rule Set
24. 30th
May 2017 Jha et al. — All That Glitters is not Gold 25
R3 entities for annotation
- The annotation process relies on a set of entities
- E might contain emerging entities (EEs)
Rule Set
E={e|τ(e)∩T A≠∅}
25. 30th
May 2017 Jha et al. — All That Glitters is not Gold 26
R4 annotation
-
(a) is a sequence of consecutive words
(b) Is a URI that links the sequence to an entity
i. e is the most precise entity possible
ii. that represents a as described in A3
Rule Set
Sa
ua e=δ(ua)
a=(Sa ,ua)
26. 30th
May 2017 Jha et al. — All That Glitters is not Gold 27
R5 annotation function
(a)
(b)
(c) has to be complete
Rule Set
A={a1, ... ,am}
ρ(d , K , E ,T A)=A
δ(uai
)∈E
∀ ai ,a j∈A∧(Sai
,Sa j
⊂d ),(Sai
∩Sa j
=∅)
A
27. 30th
May 2017 Jha et al. — All That Glitters is not Gold 28
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
28. 30th
May 2017 Jha et al. — All That Glitters is not Gold 29
Positioning error
violates rules 2 + 4(a)
Error types
Müller_scored a hattrick against England.
[...], a performance space that opened in 2006 [...]
dbr:Thomas_Müller_(footballer)
dbr:Man
Examples from DBpedia Spotlight and KORE50 datasets
dbr:England_national_football_team
29. 30th
May 2017 Jha et al. — All That Glitters is not Gold 30
Ovelapping error
violates rule 5(b)
Error types
The only accident engineers said, was when one Google car
was rear-ended while stopped at a traffic light.
dbr:Car
Example from DBpedia Spotlight dataset
dbr:Google_driverless_car
30. 30th
May 2017 Jha et al. — All That Glitters is not Gold 31
Combined marking
violates rule 4(b)i
Error types
In December 2012, [...]
dbr:December
dbr:2012
31. 30th
May 2017 Jha et al. — All That Glitters is not Gold 32
Long description error
violates rule 4(b)ii
Error types
The car is a project of Google, which has been working in
secret but in plain view on vehicles that can drive themselves, [...]
dbr:Driverless_car
Example from DBpedia Spotlight dataset
32. 30th
May 2017 Jha et al. — All That Glitters is not Gold 33
Missing entity
violates rule 5(c)
- Inconsistent marking
URI errors
- Outdated URI
dbr:People’s_Republic_of_China → dbr:China
- Disambiguation URI
dbr:Teresa
- Invalid URI
*null*
Error types
Example from DBpedia Spotlight dataset
33. 30th
May 2017 Jha et al. — All That Glitters is not Gold 34
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
35. 30th
May 2017 Jha et al. — All That Glitters is not Gold 36
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
36. 30th
May 2017 Jha et al. — All That Glitters is not Gold 37
Evaluation
Errors found
37. 30th
May 2017 Jha et al. — All That Glitters is not Gold 38
Only 4 datasets come with a set of Types
- 25 documents of ACE
- 25 documents of AIDA/CoNLL
- 30 documents of OKE 2015
Evaluation
Quality of identified errors
T A
38. 30th
May 2017 Jha et al. — All That Glitters is not Gold 39
URI errors have 0.94 accuracy
Minor problems with the CM module, e.g.
Evaluation
Quality of identified errors
Example from AIDA/CoNLL dataset
Steve Pagani VIENNA
Interannotator agreement in brackets
39. 30th
May 2017 Jha et al. — All That Glitters is not Gold 40
Evaluation
Influence on evaluation
40. 30th
May 2017 Jha et al. — All That Glitters is not Gold 41
Outline
Motivation
Rule set
Error types
Eaglet
Evaluation
Summary
∑
41. 30th
May 2017 Jha et al. — All That Glitters is not Gold 42
Summary
NER/EL gold standards can contain severe errors
42. 30th
May 2017 Jha et al. — All That Glitters is not Gold 43
Summary
NER/EL gold standards can contain severe errors
For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)
43. 30th
May 2017 Jha et al. — All That Glitters is not Gold 44
Summary
NER/EL gold standards can contain severe errors
For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)
We showed the quality of gold standards has an impact on the
evaluation results
44. Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
roeder@informatik.uni-leipzig.de
https://github.com/aksw/eaglet
Thanks for your attention!
This work has been supported by the H2020 project HOBBIT (GA no. 688227) as well as the the EuroStars projects DIESEL (project no.
01QE1512C) and QAMEL (project no. 01QE1549C).
45. 30th
May 2017 Jha et al. — All That Glitters is not Gold 46
Completion module
- 10 annotation systems
- 5 have to “vote” for an annotation to suggest it to the user
- Missed entities were found
74% for ACE2004
92% for AIDA/CoNLL
57% for OKE2015
Evaluation
Quality of identified errors
46. 30th
May 2017 Jha et al. — All That Glitters is not Gold 47
Evaluation
Influence on evaluation