SlideShare a Scribd company logo
1 of 46
Download to read offline
Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo1,2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
All That Glitters is not Gold
Rule-Based Curation of Reference Datasets for
Named Entity Recognition and Entity Linking
30th
May 2017 Jha et al. — All That Glitters is not Gold 2
Outline
 Motivation
30th
May 2017 Jha et al. — All That Glitters is not Gold 3
Outline
 Motivation
 Rule set
30th
May 2017 Jha et al. — All That Glitters is not Gold 4
Outline
 Motivation
 Rule set
 Error types
30th
May 2017 Jha et al. — All That Glitters is not Gold 5
Outline
 Motivation
 Rule set
 Error types
 Eaglet
30th
May 2017 Jha et al. — All That Glitters is not Gold 6
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
∑
30th
May 2017 Jha et al. — All That Glitters is not Gold 7
Motivation
30th
May 2017 Jha et al. — All That Glitters is not Gold 8
Motivation
KB
30th
May 2017 Jha et al. — All That Glitters is not Gold 9
Motivation
Annotation of texts (A2KB)
Bosch and Sharp are
both home appliances
producing companies.
Example from KORE50 dataset
30th
May 2017 Jha et al. — All That Glitters is not Gold 10
Motivation
Named Entity Recognition
Bosch and Sharp are
both home appliances
producing companies.
Example from KORE50 dataset
30th
May 2017 Jha et al. — All That Glitters is not Gold 11
KB
Motivation
Entity Linking
Bosch and Sharp are
both home appliances
producing companies.
dbr:Robert_Bosch_GmbH
dbr:Sharp_Corporation
30th
May 2017 Jha et al. — All That Glitters is not Gold 12
Motivation
Evaluation
System
=
?
30th
May 2017 Jha et al. — All That Glitters is not Gold 14
Motivation
Evaluation
System
=
?
30th
May 2017 Jha et al. — All That Glitters is not Gold 15
Motivation
Evaluation
System
=
?
Bosch and Sharp are
both home appliances
producing companies.
dbr:Sharp
30th
May 2017 Jha et al. — All That Glitters is not Gold 16
Motivation
Evaluation
System
=
?
Bosch and Sharp are
both home appliances
producing companies.
dbr:Sharp
How can we check our gold standards?
30th
May 2017 Jha et al. — All That Glitters is not Gold 17
Motivation
Approach
 Set of annotation rules for gold standards
 Error types violating these rules
 Tool for the semi-automatic check of gold standards
30th
May 2017 Jha et al. — All That Glitters is not Gold 18
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
30th
May 2017 Jha et al. — All That Glitters is not Gold 19
 A1: A single sentence has a linear structure
Rule Set
Assumptions
Barack and Michelle Obama
30th
May 2017 Jha et al. — All That Glitters is not Gold 20
 A2: The annotation should cover as many consecutive words as
possible
- Name the entity as precisely as possible
Rule Set
Assumptions
legendary cryptanalyst Alan Turing
30th
May 2017 Jha et al. — All That Glitters is not Gold 21
 A3: Each annotation should be linked to the most precise
resource of the KB
Rule Set
Assumptions
113th
United States Congress
dbr:113th_United_States_Congress
dbr:United_States_Congress
X
30th
May 2017 Jha et al. — All That Glitters is not Gold 22
 A4: The annotated string should point to a specific entity
 A5: A set of entity types TA
is given to define which entities can
be found and which resources of a KB that can be used for
linking
Rule Set
Assumptions
TA
= {dbo:Person, dbo:Place, dbo:Organisation}
30th
May 2017 Jha et al. — All That Glitters is not Gold 23
 R1 dataset and documents
- Each Dataset D is a set of documents
- Each document d is an ordered set of words
d={w1
,...,wn
}
Rule Set
30th
May 2017 Jha et al. — All That Glitters is not Gold 24
 R2 words
- Each word wi
d∈ is a sequence of characters or digits
starting
 at the beginning of the document or
 after a whitespace
- And ending
 at the end of the document or
 before a whitespace or punctuation character.
Rule Set
30th
May 2017 Jha et al. — All That Glitters is not Gold 25
 R3 entities for annotation
- The annotation process relies on a set of entities
- E might contain emerging entities (EEs)
Rule Set
E={e|τ(e)∩T A≠∅}
30th
May 2017 Jha et al. — All That Glitters is not Gold 26
 R4 annotation
-
(a) is a sequence of consecutive words
(b) Is a URI that links the sequence to an entity
i. e is the most precise entity possible
ii. that represents a as described in A3
Rule Set
Sa
ua e=δ(ua)
a=(Sa ,ua)
30th
May 2017 Jha et al. — All That Glitters is not Gold 27
 R5 annotation function
(a)
(b)
(c) has to be complete
Rule Set
A={a1, ... ,am}
ρ(d , K , E ,T A)=A
δ(uai
)∈E
∀ ai ,a j∈A∧(Sai
,Sa j
⊂d ),(Sai
∩Sa j
=∅)
A
30th
May 2017 Jha et al. — All That Glitters is not Gold 28
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
30th
May 2017 Jha et al. — All That Glitters is not Gold 29
 Positioning error
violates rules 2 + 4(a)
Error types
Müller_scored a hattrick against England.
[...], a performance space that opened in 2006 [...]
dbr:Thomas_Müller_(footballer)
dbr:Man
Examples from DBpedia Spotlight and KORE50 datasets
dbr:England_national_football_team
30th
May 2017 Jha et al. — All That Glitters is not Gold 30
 Ovelapping error
violates rule 5(b)
Error types
The only accident engineers said, was when one Google car
was rear-ended while stopped at a traffic light.
dbr:Car
Example from DBpedia Spotlight dataset
dbr:Google_driverless_car
30th
May 2017 Jha et al. — All That Glitters is not Gold 31
 Combined marking
violates rule 4(b)i
Error types
In December 2012, [...]
dbr:December
dbr:2012
30th
May 2017 Jha et al. — All That Glitters is not Gold 32
 Long description error
violates rule 4(b)ii
Error types
The car is a project of Google, which has been working in
secret but in plain view on vehicles that can drive themselves, [...]
dbr:Driverless_car
Example from DBpedia Spotlight dataset
30th
May 2017 Jha et al. — All That Glitters is not Gold 33
 Missing entity
violates rule 5(c)
- Inconsistent marking
 URI errors
- Outdated URI
dbr:People’s_Republic_of_China → dbr:China
- Disambiguation URI
dbr:Teresa
- Invalid URI
*null*
Error types
Example from DBpedia Spotlight dataset
30th
May 2017 Jha et al. — All That Glitters is not Gold 34
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
30th
May 2017 Jha et al. — All That Glitters is not Gold 35
Eaglet
30th
May 2017 Jha et al. — All That Glitters is not Gold 36
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
30th
May 2017 Jha et al. — All That Glitters is not Gold 37
Evaluation
Errors found
30th
May 2017 Jha et al. — All That Glitters is not Gold 38
 Only 4 datasets come with a set of Types
- 25 documents of ACE
- 25 documents of AIDA/CoNLL
- 30 documents of OKE 2015
Evaluation
Quality of identified errors
T A
30th
May 2017 Jha et al. — All That Glitters is not Gold 39
 URI errors have 0.94 accuracy
 Minor problems with the CM module, e.g.
Evaluation
Quality of identified errors
Example from AIDA/CoNLL dataset
Steve Pagani VIENNA
Interannotator agreement in brackets
30th
May 2017 Jha et al. — All That Glitters is not Gold 40
Evaluation
Influence on evaluation
30th
May 2017 Jha et al. — All That Glitters is not Gold 41
Outline
 Motivation
 Rule set
 Error types
 Eaglet
 Evaluation
 Summary
∑
30th
May 2017 Jha et al. — All That Glitters is not Gold 42
Summary
 NER/EL gold standards can contain severe errors
30th
May 2017 Jha et al. — All That Glitters is not Gold 43
Summary
 NER/EL gold standards can contain severe errors
 For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)
30th
May 2017 Jha et al. — All That Glitters is not Gold 44
Summary
 NER/EL gold standards can contain severe errors
 For the semi-automatic check of gold standards, we developed
- a set of rules
- a tool (will be presented during the poster session)
 We showed the quality of gold standards has an impact on the
evaluation results
Kunal Jha1
, Michael Röder1
, Axel-Cyrille Ngonga Ngomo2
1
AKSW, Leipzig University, Germany
2
Data Science Group, University of Paderborn, Germany
roeder@informatik.uni-leipzig.de
https://github.com/aksw/eaglet
Thanks for your attention!
This work has been supported by the H2020 project HOBBIT (GA no. 688227) as well as the the EuroStars projects DIESEL (project no.
01QE1512C) and QAMEL (project no. 01QE1549C).
30th
May 2017 Jha et al. — All That Glitters is not Gold 46
 Completion module
- 10 annotation systems
- 5 have to “vote” for an annotation to suggest it to the user
- Missed entities were found
 74% for ACE2004
 92% for AIDA/CoNLL
 57% for OKE2015
Evaluation
Quality of identified errors
30th
May 2017 Jha et al. — All That Glitters is not Gold 47
Evaluation
Influence on evaluation

More Related Content

More from Holistic Benchmarking of Big Linked Data

4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...Holistic Benchmarking of Big Linked Data
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...Holistic Benchmarking of Big Linked Data
 
Introducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
Introducing the HOBBIT platform into the Ontology Alignment Evaluation CampaignIntroducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
Introducing the HOBBIT platform into the Ontology Alignment Evaluation CampaignHolistic Benchmarking of Big Linked Data
 
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017.
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK  ISWC2017. Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK  ISWC2017.
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017. Holistic Benchmarking of Big Linked Data
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHolistic Benchmarking of Big Linked Data
 

More from Holistic Benchmarking of Big Linked Data (20)

LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
LargeRDFBench: A billion triples benchmark for SPARQL endpoint federationLargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
 
Scalable Link Discovery for Modern Data-Driven Applications (poster)
Scalable Link Discovery for Modern Data-Driven Applications (poster)Scalable Link Discovery for Modern Data-Driven Applications (poster)
Scalable Link Discovery for Modern Data-Driven Applications (poster)
 
An Evaluation of Models for Runtime Approximation in Link Discovery
An Evaluation of Models for Runtime Approximation in Link DiscoveryAn Evaluation of Models for Runtime Approximation in Link Discovery
An Evaluation of Models for Runtime Approximation in Link Discovery
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
SPgen: A Benchmark Generator for Spatial Link Discovery ToolsSPgen: A Benchmark Generator for Spatial Link Discovery Tools
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
 
Introducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
Introducing the HOBBIT platform into the Ontology Alignment Evaluation CampaignIntroducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
Introducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
 
OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018
 
MOCHA 2018 Challenge @ ESWC2018
MOCHA 2018 Challenge @ ESWC2018MOCHA 2018 Challenge @ ESWC2018
MOCHA 2018 Challenge @ ESWC2018
 
Dynamic planning for link discovery - ESWC 2018
Dynamic planning for link discovery - ESWC 2018Dynamic planning for link discovery - ESWC 2018
Dynamic planning for link discovery - ESWC 2018
 
Hobbit project overview presented at EBDVF 2017
Hobbit project overview presented at EBDVF 2017Hobbit project overview presented at EBDVF 2017
Hobbit project overview presented at EBDVF 2017
 
Leopard ISWC Semantic Web Challenge 2017 (poster)
Leopard ISWC Semantic Web Challenge 2017 (poster)Leopard ISWC Semantic Web Challenge 2017 (poster)
Leopard ISWC Semantic Web Challenge 2017 (poster)
 
Leopard ISWC Semantic Web Challenge 2017
Leopard ISWC Semantic Web Challenge 2017Leopard ISWC Semantic Web Challenge 2017
Leopard ISWC Semantic Web Challenge 2017
 
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017.
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK  ISWC2017. Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK  ISWC2017.
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017.
 
Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017
Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017
Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017
 
HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
Benchmarking Faceted Browsing Capabilities of Triple Stores
Benchmarking Faceted Browsing Capabilities of Triple StoresBenchmarking Faceted Browsing Capabilities of Triple Stores
Benchmarking Faceted Browsing Capabilities of Triple Stores
 

Recently uploaded

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 

Recently uploaded (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 

All That Glitters is not Gold

  • 1. Kunal Jha1 , Michael Röder1 , Axel-Cyrille Ngonga Ngomo1,2 1 AKSW, Leipzig University, Germany 2 Data Science Group, University of Paderborn, Germany All That Glitters is not Gold Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking
  • 2. 30th May 2017 Jha et al. — All That Glitters is not Gold 2 Outline  Motivation
  • 3. 30th May 2017 Jha et al. — All That Glitters is not Gold 3 Outline  Motivation  Rule set
  • 4. 30th May 2017 Jha et al. — All That Glitters is not Gold 4 Outline  Motivation  Rule set  Error types
  • 5. 30th May 2017 Jha et al. — All That Glitters is not Gold 5 Outline  Motivation  Rule set  Error types  Eaglet
  • 6. 30th May 2017 Jha et al. — All That Glitters is not Gold 6 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary ∑
  • 7. 30th May 2017 Jha et al. — All That Glitters is not Gold 7 Motivation
  • 8. 30th May 2017 Jha et al. — All That Glitters is not Gold 8 Motivation KB
  • 9. 30th May 2017 Jha et al. — All That Glitters is not Gold 9 Motivation Annotation of texts (A2KB) Bosch and Sharp are both home appliances producing companies. Example from KORE50 dataset
  • 10. 30th May 2017 Jha et al. — All That Glitters is not Gold 10 Motivation Named Entity Recognition Bosch and Sharp are both home appliances producing companies. Example from KORE50 dataset
  • 11. 30th May 2017 Jha et al. — All That Glitters is not Gold 11 KB Motivation Entity Linking Bosch and Sharp are both home appliances producing companies. dbr:Robert_Bosch_GmbH dbr:Sharp_Corporation
  • 12. 30th May 2017 Jha et al. — All That Glitters is not Gold 12 Motivation Evaluation System = ?
  • 13. 30th May 2017 Jha et al. — All That Glitters is not Gold 14 Motivation Evaluation System = ?
  • 14. 30th May 2017 Jha et al. — All That Glitters is not Gold 15 Motivation Evaluation System = ? Bosch and Sharp are both home appliances producing companies. dbr:Sharp
  • 15. 30th May 2017 Jha et al. — All That Glitters is not Gold 16 Motivation Evaluation System = ? Bosch and Sharp are both home appliances producing companies. dbr:Sharp How can we check our gold standards?
  • 16. 30th May 2017 Jha et al. — All That Glitters is not Gold 17 Motivation Approach  Set of annotation rules for gold standards  Error types violating these rules  Tool for the semi-automatic check of gold standards
  • 17. 30th May 2017 Jha et al. — All That Glitters is not Gold 18 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary
  • 18. 30th May 2017 Jha et al. — All That Glitters is not Gold 19  A1: A single sentence has a linear structure Rule Set Assumptions Barack and Michelle Obama
  • 19. 30th May 2017 Jha et al. — All That Glitters is not Gold 20  A2: The annotation should cover as many consecutive words as possible - Name the entity as precisely as possible Rule Set Assumptions legendary cryptanalyst Alan Turing
  • 20. 30th May 2017 Jha et al. — All That Glitters is not Gold 21  A3: Each annotation should be linked to the most precise resource of the KB Rule Set Assumptions 113th United States Congress dbr:113th_United_States_Congress dbr:United_States_Congress X
  • 21. 30th May 2017 Jha et al. — All That Glitters is not Gold 22  A4: The annotated string should point to a specific entity  A5: A set of entity types TA is given to define which entities can be found and which resources of a KB that can be used for linking Rule Set Assumptions TA = {dbo:Person, dbo:Place, dbo:Organisation}
  • 22. 30th May 2017 Jha et al. — All That Glitters is not Gold 23  R1 dataset and documents - Each Dataset D is a set of documents - Each document d is an ordered set of words d={w1 ,...,wn } Rule Set
  • 23. 30th May 2017 Jha et al. — All That Glitters is not Gold 24  R2 words - Each word wi d∈ is a sequence of characters or digits starting  at the beginning of the document or  after a whitespace - And ending  at the end of the document or  before a whitespace or punctuation character. Rule Set
  • 24. 30th May 2017 Jha et al. — All That Glitters is not Gold 25  R3 entities for annotation - The annotation process relies on a set of entities - E might contain emerging entities (EEs) Rule Set E={e|τ(e)∩T A≠∅}
  • 25. 30th May 2017 Jha et al. — All That Glitters is not Gold 26  R4 annotation - (a) is a sequence of consecutive words (b) Is a URI that links the sequence to an entity i. e is the most precise entity possible ii. that represents a as described in A3 Rule Set Sa ua e=δ(ua) a=(Sa ,ua)
  • 26. 30th May 2017 Jha et al. — All That Glitters is not Gold 27  R5 annotation function (a) (b) (c) has to be complete Rule Set A={a1, ... ,am} ρ(d , K , E ,T A)=A δ(uai )∈E ∀ ai ,a j∈A∧(Sai ,Sa j ⊂d ),(Sai ∩Sa j =∅) A
  • 27. 30th May 2017 Jha et al. — All That Glitters is not Gold 28 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary
  • 28. 30th May 2017 Jha et al. — All That Glitters is not Gold 29  Positioning error violates rules 2 + 4(a) Error types Müller_scored a hattrick against England. [...], a performance space that opened in 2006 [...] dbr:Thomas_Müller_(footballer) dbr:Man Examples from DBpedia Spotlight and KORE50 datasets dbr:England_national_football_team
  • 29. 30th May 2017 Jha et al. — All That Glitters is not Gold 30  Ovelapping error violates rule 5(b) Error types The only accident engineers said, was when one Google car was rear-ended while stopped at a traffic light. dbr:Car Example from DBpedia Spotlight dataset dbr:Google_driverless_car
  • 30. 30th May 2017 Jha et al. — All That Glitters is not Gold 31  Combined marking violates rule 4(b)i Error types In December 2012, [...] dbr:December dbr:2012
  • 31. 30th May 2017 Jha et al. — All That Glitters is not Gold 32  Long description error violates rule 4(b)ii Error types The car is a project of Google, which has been working in secret but in plain view on vehicles that can drive themselves, [...] dbr:Driverless_car Example from DBpedia Spotlight dataset
  • 32. 30th May 2017 Jha et al. — All That Glitters is not Gold 33  Missing entity violates rule 5(c) - Inconsistent marking  URI errors - Outdated URI dbr:People’s_Republic_of_China → dbr:China - Disambiguation URI dbr:Teresa - Invalid URI *null* Error types Example from DBpedia Spotlight dataset
  • 33. 30th May 2017 Jha et al. — All That Glitters is not Gold 34 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary
  • 34. 30th May 2017 Jha et al. — All That Glitters is not Gold 35 Eaglet
  • 35. 30th May 2017 Jha et al. — All That Glitters is not Gold 36 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary
  • 36. 30th May 2017 Jha et al. — All That Glitters is not Gold 37 Evaluation Errors found
  • 37. 30th May 2017 Jha et al. — All That Glitters is not Gold 38  Only 4 datasets come with a set of Types - 25 documents of ACE - 25 documents of AIDA/CoNLL - 30 documents of OKE 2015 Evaluation Quality of identified errors T A
  • 38. 30th May 2017 Jha et al. — All That Glitters is not Gold 39  URI errors have 0.94 accuracy  Minor problems with the CM module, e.g. Evaluation Quality of identified errors Example from AIDA/CoNLL dataset Steve Pagani VIENNA Interannotator agreement in brackets
  • 39. 30th May 2017 Jha et al. — All That Glitters is not Gold 40 Evaluation Influence on evaluation
  • 40. 30th May 2017 Jha et al. — All That Glitters is not Gold 41 Outline  Motivation  Rule set  Error types  Eaglet  Evaluation  Summary ∑
  • 41. 30th May 2017 Jha et al. — All That Glitters is not Gold 42 Summary  NER/EL gold standards can contain severe errors
  • 42. 30th May 2017 Jha et al. — All That Glitters is not Gold 43 Summary  NER/EL gold standards can contain severe errors  For the semi-automatic check of gold standards, we developed - a set of rules - a tool (will be presented during the poster session)
  • 43. 30th May 2017 Jha et al. — All That Glitters is not Gold 44 Summary  NER/EL gold standards can contain severe errors  For the semi-automatic check of gold standards, we developed - a set of rules - a tool (will be presented during the poster session)  We showed the quality of gold standards has an impact on the evaluation results
  • 44. Kunal Jha1 , Michael Röder1 , Axel-Cyrille Ngonga Ngomo2 1 AKSW, Leipzig University, Germany 2 Data Science Group, University of Paderborn, Germany roeder@informatik.uni-leipzig.de https://github.com/aksw/eaglet Thanks for your attention! This work has been supported by the H2020 project HOBBIT (GA no. 688227) as well as the the EuroStars projects DIESEL (project no. 01QE1512C) and QAMEL (project no. 01QE1549C).
  • 45. 30th May 2017 Jha et al. — All That Glitters is not Gold 46  Completion module - 10 annotation systems - 5 have to “vote” for an annotation to suggest it to the user - Missed entities were found  74% for ACE2004  92% for AIDA/CoNLL  57% for OKE2015 Evaluation Quality of identified errors
  • 46. 30th May 2017 Jha et al. — All That Glitters is not Gold 47 Evaluation Influence on evaluation