SlideShare a Scribd company logo
NLP Data Cleansing Based on Linguistic Ontology
Constraints
Dimitris Kontokostas13
Martin Brümmer1
Sebastian Hellmann13
Jens Lehmann1
Lazaros Ioannidis2
1AKSW, University of Leipzig
2Aristotle University of Thessaloniki
3DBpedia Association
2014-05-27
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33
LOD Cloud (2011)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33
LOD Cloud (2011)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33
Linguistic Communities
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33
Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33
Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33
Linguistic LOD Cloud (LLOD Cloud)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33
Problem denition
Linguistic (related) Data
Purpose-Driven denition
Increasing Data, ontologies  vocabularies
New-comers → hard to understand the ontologies / follow updates
Validation is essential
Many dierent pipelines (parsing, annotation, disambiguation, etc)
Errors are propagated
Partially provided by maintainers (incomplete)
Focus on Lemon  NIF (proof of concept)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33
Lemon - Lexicon Model for Ontologies
Models lexicon and machine-readable
dictionaries
RDF-native form
Linguistically sound structure (LMF)
Separation of the lexicon and
ontology layers
Linking to data categories →
arbitrarily complex linguistic
description
Principle of least power - the less
expressive the language, the more
reusable the data.
http://lemon-model.net/
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33
Lemon - Example
: l e x i c o n a lemon : Lexicon ;
lemon : entry : Pizza , : T o r t i l l a .
: Pizza a lemon : LexicalEntry ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource /Pizza ] .
: T o r t i l l a a lemon : LexicalEntry ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource / T o r t i l l a  ] .
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33
Lemon - Example (Correct)
: l e x i c o n a lemon : Lexicon ;
lemon : language en ;
lemon : entry : Pizza , : T o r t i l l a .
: Pizza a lemon : LexicalEntry ;
lemon : canonicalForm [
lemon : writtenRep  Pizza @en ] ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource /Pizza ].
: T o r t i l l a a lemon : LexicalEntry ;
lemon : canonicalForm [
lemon : writtenRep  T o r t i l l a @en ] ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource / T o r t i l l a ].
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33
NIF - NLP Interchange Format
RDF/OWL-based format that aims to achieve interoperability between
Natural Language Processing (NLP) tools, language resources and
annotations
In a nutshell:
Logical formalisation of strings and annotations
Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147
Reuse of RDF tool stack
Decreases development cost for integration
Integrated in:
DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator,
ConLL converter , ...
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33
NIF - Overview
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33
NIF - Example
http :// abc . com/doc#char=0,17
a n i f : Context ;
a n i f : RFC147String ;
n i f : beginIndex 0 ;
n i f : endIndex 17 ;
n i f : i s S t r i n g My dog l i k e s pizza  .
http :// abc . com/doc#char=2,7
a n i f : RFC5147String ;
n i f : anchorOf  dog  ;
n i f : referenceContext http :// abc . com/doc#char=0,17 .
i t s r d f : taClassRef dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33
NIF - Example (Correct)
http :// abc . com/doc#char=0,18
a n i f : Context ;
a n i f :RFC5147 String ;
n i f : beginIndex 0^^xsd : nonNegativeInteger ;
n i f : endIndex 18^^xsd : nonNegativeInteger ;
n i f : i s S t r i n g My dog l i k e s pizza ^^xsd : s t r i n g .
http :// abc . com/doc#char=2,7
a n i f : RFC5147String ;
n i f : beginIndex 2^^xsd : nonNegativeInteger ;
n i f : endIndex 7^^xsd : nonNegativeInteger ;
n i f : anchorOf  dog ^^xsd : s t r i n g ;
n i f : referenceContext http :// abc . com/doc#char=0,27 .
i t s r d f : taClassRef dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33
Maintainer validation
Lemon
Python script
24 tests for structural criteria
too slow on big datasets
not good reporting
NIF
SPARQL queries
11 tests for common errors
not complete
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33
Built on previous work
Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick
Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland
Cornelissen, and Amrapali J. Zaveri in WWW 2014.
Horizontal, multi-domain data quality assessment
Massive detection of errors for ve large-scale LOD data sets
291 vocabularies, independent of their domain or purpose
New contributions:
Relation to OWL reasoners
Test Driven Data Engineering Ontology
Domain-specic validation
Quickly improving existing validation options provided by maintainers
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33
Test-Driven Data Development Methodology
Test case: a data constraint that involves one or more triples
Test suite: a set of test cases for testing a dataset
Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
Fail: Error, warning or notice
RDF: basis for both data and schema
Unied model facilitates automatic test case generation
SPARQL serves as the test case denition language
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33
Example test case
A nif:RFC5147String should never have a nif:beginIndex greater than
nif:endIndex
Test cases are written in SPARQL
SELECT ? s WHERE {
? s n i f : beginIndex ?v1 .
? s n i f : endIndex ?v2 .
FILTER ( ?v1  ?v2 ) }
We query for errors
Success: Query returns empty result set
Fail: Query returns results
Every result we get is a violation instance
Timeout / Error: needs further investigation on SPARQL Engine
capabilities, query syntax or query complexity
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33
Patterns  Bindings
Data Quality Test Patterns (DQTP)
abstract patterns, which can be further rened into concrete data quality
test cases using test pattern bindings
Existing library of 20 patterns
SELECT ? s WHERE {
? s %%P1%% ?v1 .
? s %%P2%% ?v2 .
FILTER ( ?v1 %%OP%% ?v2 ) }
Bindings
mapping of variables to valid pattern replacement
P1 = n i f : beginIndex | SELECT ? s WHERE {
P2 = n i f : endIndex | ? s n i f : beginIndex ?v1 .
OP =  | ? s n i f : endIndex ?v2 .
| FILTER ( ?v1  ?v2 ) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33
Test Auto Generators (TAGs)
RDF(s)  OWL (partial) support
Query schema for supported axioms
SELECT DISTINCT ?T1 ?T2 WHERE {
?T1 owl : d i s j o i n t W i t h ?T2 . }
For every result a binding to a pattern is generated  a test case
instantiated
Supported axioms at the moment:
RDFS: domain  range
OWL: minCardinality, maxCardinality, cardinality, functionalProperty,
InverseFunctionalProperty, disjointClass, propertyDisjointWith,
AsymmetricProperty and deprecated
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33
Test Case Elicitation Workow
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33
TD(D)D vs Reasoners
SPARQL test cases detect a subset of validation errors detectable by
an OWL reasoner. Limited by
SPARQL endpoint reasoning support
limitations of the OWL-to-SPARQL translation.
SPARQL test cases detect validation errors not expressible in OWL
OWL reasoning is often not feasible on large datasets.
Datasets are already deployed and accessible via SPARQL endpoints
Pattern library more user friendly approach for building validation rules
compared to modelling OWL axioms.
requires familiarity
non-common validations require manual SPARQL test cases
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33
Data Engineering Ontology
Input / Output entirely in RDF
Model the methodology in OWL
test suites, test cases, patterns, auto generators
Strict to serve as a validation layer
Four dierent levels of error reporting
simple test case report (success, fail) / enriched with counts
violation instance reporting / enriched with annotations
Reuse dcterms, prov, spin, rlog
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33
Data Engineering Ontology - Denition  Generation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33
Data Engineering Ontology - Result Representation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33
Lemon  NIF Test case elicitation
RDFUnit Suite implements our methodology
Run on Lemon  NIF ontologies
TAGs could not yet handle some complex owl:Restrictions
owl:unionOf, owl:allValuesFrom, owl:someValuesFrom,
owl:hasSelf and some rdfs:subPropertyOf cases
Manual test cases for constraints not captured in OWL.
Total Domain Range Datatype Card. Disj. Func. I. Func. Manual
Lemon 182 40 34 1 29 64 3 1 10
NIF 96 42 24 4 6 10 10
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33
Example of manual Lemon test case
lemon:narrower denotes that one sense of a word is narrower than the
other and must never be symmetric or contain cycles.
SELECT DISTINCT ? s WHERE {
? s lemon : narrower+ ? narrower .
? narrower lemon : narrower+ ? s . }
lemon:language must not have a language tag (RDF1.1 to the rescue)
SELECT DISTINCT ? s WHERE {
? s lemon : language ?v1 .
FILTER ( lang (? v1 )!=) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33
Example of manual NIF test case
Ensure that nif:beginIndex  nif:endIndex index are correct
SELECT DISTINCT ? s WHERE {
? s n i f : anchorOf ? anchorOf ;
n i f : beginIndex ? beginIndex ;
n i f : endIndex ? endIndex ;
n i f : referenceContext
[ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] .
BIND (SUBSTR(? r e f e r e n c e S t r i n g ,
? beginIndex ,
(? endIndex − ? beginIndex ) ) AS ? t e s t ) .
FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33
Evaluation Datasets
Name Description Ontology Type
lemon datasets
LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon,
UBY-LMF
Dictionary
LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon,
UBY-LMF
Dictionary
LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMF
model
lemon,
UBY-LMF
WordNet
DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary
QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary
NIF datasets
Wikilinks sample of 60976 randomly selected phrases linked to
Wikipedia articles
NIF NER
DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER
KORE 50 evaluation
dataset
50 NE annotated natural language sentences from the AIDA
corpus
NIF NER
News-100 100 manually annotated German news articles NIF NER
RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER
Reuters-128 128 news articles manually curated NIF NER
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33
Evaluation results
Size SC FL TO ER Auto Errors Man Errors MWarn MInfo
WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837
WktEN 8M 168 14 - - 752.018 394.766 - 633.270
WktDE 2M 170 12 - - 273.109 66.268 - 155.598
Wordnet 4M 166 16 - - 257.228 36 - 257.204
QHL 3M 170 11 - 1 433.118 538.933 - 538.016
Wikilinks 0.6M 91 4 - 1 141.528 21.246 - -
News-100 13K 91 2 - 3 3.510 - - -
RSS-500 10K 91 2 - 3 3.000 - - -
Reuters-128 7K 91 2 - 3 2.016 - - -
Spotlight 3K 92 3 - 1 662 68 - -
KORE50 2K 89 6 - 1 301 55 - -
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 31 / 33
Conclusion
Extended a previously introduced methodology for test-driven quality
assessment
Data engineering ontology
Devised 277 test cases for NLP datasets using the Lemon and NIF
vocabularies
Revealed a substantial number of errors for Lemon  NIF datasets
Future directions
extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF)
automatic dependencies between test cases
wrap RDFUnit for NLP services (integrated in NIF)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33
Thank you!
Dimitris Kontokostas
With kind support of
John McCrae (Lemon model)
http://rdfunit.aksw.org
http://github.com/AKSW/RDFUnit
#eswc2014kontokostas
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33

More Related Content

What's hot

The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
University of California, San Diego
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
Shadi Saleh
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils FlywebJun Zhao
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Data translation with SPARQL 1.1
Data translation with SPARQL 1.1Data translation with SPARQL 1.1
Data translation with SPARQL 1.1andreas_schultz
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Olaf Hartig
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Mariano Rodriguez-Muro
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
AdonisDamian
 
SPARQL 1.1 Status
SPARQL 1.1 StatusSPARQL 1.1 Status
SPARQL 1.1 Status
LeeFeigenbaum
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
Leigh Dodds
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
Josef Hardi
 
A Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF ProcessingA Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF Processing
lucianb
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
Sumin Byeon
 

What's hot (15)

The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Data translation with SPARQL 1.1
Data translation with SPARQL 1.1Data translation with SPARQL 1.1
Data translation with SPARQL 1.1
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
SPARQL 1.1 Status
SPARQL 1.1 StatusSPARQL 1.1 Status
SPARQL 1.1 Status
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
A Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF ProcessingA Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF Processing
 
07 04-06
07 04-0607 04-06
07 04-06
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
 

Viewers also liked

Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
Empowered Holdings, LLC
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
Zuhair khayyat
 
Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010Rami Mansour
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
Precisely
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
InfoCheckPoint
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
Stefan Urbanek
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
Data Blueprint
 

Viewers also liked (8)

Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 

Similar to NLP Data Cleansing Based on Linguistic Ontology Constraints

An experience on empirical research about rdf stream
An experience on empirical research about rdf streamAn experience on empirical research about rdf stream
An experience on empirical research about rdf stream
Daniele Dell'Aglio
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
alessio_ferrari
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
Snow Owl
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
Resource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and FederationResource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and Federation
Pistoia Alliance
 
Semantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extensionSemantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extension
Oscar Corcho
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
Adrian Paschke
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
Vsevolod Dyomkin
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through Semantics
Ioannis Stavrakantonakis
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
Marko Rodriguez
 
Stream Reasoning: Where we got so far. Oxford 2010.1.18
Stream Reasoning: Where we got so far. Oxford 2010.1.18Stream Reasoning: Where we got so far. Oxford 2010.1.18
Stream Reasoning: Where we got so far. Oxford 2010.1.18Emanuele Della Valle
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency Parsing
Jinho Choi
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
Dirk Roorda
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
Rakebul Hasan
 

Similar to NLP Data Cleansing Based on Linguistic Ontology Constraints (20)

An experience on empirical research about rdf stream
An experience on empirical research about rdf streamAn experience on empirical research about rdf stream
An experience on empirical research about rdf stream
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Resource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and FederationResource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and Federation
 
Semantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extensionSemantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extension
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through Semantics
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Stream Reasoning: Where we got so far. Oxford 2010.1.18
Stream Reasoning: Where we got so far. Oxford 2010.1.18Stream Reasoning: Where we got so far. Oxford 2010.1.18
Stream Reasoning: Where we got so far. Oxford 2010.1.18
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency Parsing
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
CICLing_2016_paper_52
CICLing_2016_paper_52CICLing_2016_paper_52
CICLing_2016_paper_52
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 

More from Dimitris Kontokostas

Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Dimitris Kontokostas
 
Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...
Dimitris Kontokostas
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
Dimitris Kontokostas
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
Dimitris Kontokostas
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
Dimitris Kontokostas
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
Dimitris Kontokostas
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
Dimitris Kontokostas
 
DBpedia past, present & future
DBpedia past, present & futureDBpedia past, present & future
DBpedia past, present & future
Dimitris Kontokostas
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 
DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
Dimitris Kontokostas
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
Dimitris Kontokostas
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
Dimitris Kontokostas
 

More from Dimitris Kontokostas (13)

Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
 
DBpedia past, present & future
DBpedia past, present & futureDBpedia past, present & future
DBpedia past, present & future
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
 

Recently uploaded

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
abdulrafaychaudhry
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
vrstrong314
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 

Recently uploaded (20)

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 

NLP Data Cleansing Based on Linguistic Ontology Constraints

  • 1. NLP Data Cleansing Based on Linguistic Ontology Constraints Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13 Jens Lehmann1 Lazaros Ioannidis2 1AKSW, University of Leipzig 2Aristotle University of Thessaloniki 3DBpedia Association 2014-05-27 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33
  • 2. LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33
  • 3. LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33
  • 4. Linguistic Communities Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33
  • 5. Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33
  • 6. Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33
  • 7. Linguistic LOD Cloud (LLOD Cloud) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33
  • 8. Problem denition Linguistic (related) Data Purpose-Driven denition Increasing Data, ontologies vocabularies New-comers → hard to understand the ontologies / follow updates Validation is essential Many dierent pipelines (parsing, annotation, disambiguation, etc) Errors are propagated Partially provided by maintainers (incomplete) Focus on Lemon NIF (proof of concept) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33
  • 9. Lemon - Lexicon Model for Ontologies Models lexicon and machine-readable dictionaries RDF-native form Linguistically sound structure (LMF) Separation of the lexicon and ontology layers Linking to data categories → arbitrarily complex linguistic description Principle of least power - the less expressive the language, the more reusable the data. http://lemon-model.net/ Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33
  • 10. Lemon - Example : l e x i c o n a lemon : Lexicon ; lemon : entry : Pizza , : T o r t i l l a . : Pizza a lemon : LexicalEntry ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource /Pizza ] . : T o r t i l l a a lemon : LexicalEntry ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource / T o r t i l l a ] . Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33
  • 11. Lemon - Example (Correct) : l e x i c o n a lemon : Lexicon ; lemon : language en ; lemon : entry : Pizza , : T o r t i l l a . : Pizza a lemon : LexicalEntry ; lemon : canonicalForm [ lemon : writtenRep Pizza @en ] ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource /Pizza ]. : T o r t i l l a a lemon : LexicalEntry ; lemon : canonicalForm [ lemon : writtenRep T o r t i l l a @en ] ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource / T o r t i l l a ]. Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33
  • 12. NIF - NLP Interchange Format RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations In a nutshell: Logical formalisation of strings and annotations Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147 Reuse of RDF tool stack Decreases development cost for integration Integrated in: DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator, ConLL converter , ... Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33
  • 13. NIF - Overview Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33
  • 14. NIF - Example http :// abc . com/doc#char=0,17 a n i f : Context ; a n i f : RFC147String ; n i f : beginIndex 0 ; n i f : endIndex 17 ; n i f : i s S t r i n g My dog l i k e s pizza . http :// abc . com/doc#char=2,7 a n i f : RFC5147String ; n i f : anchorOf dog ; n i f : referenceContext http :// abc . com/doc#char=0,17 . i t s r d f : taClassRef dbo : Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33
  • 15. NIF - Example (Correct) http :// abc . com/doc#char=0,18 a n i f : Context ; a n i f :RFC5147 String ; n i f : beginIndex 0^^xsd : nonNegativeInteger ; n i f : endIndex 18^^xsd : nonNegativeInteger ; n i f : i s S t r i n g My dog l i k e s pizza ^^xsd : s t r i n g . http :// abc . com/doc#char=2,7 a n i f : RFC5147String ; n i f : beginIndex 2^^xsd : nonNegativeInteger ; n i f : endIndex 7^^xsd : nonNegativeInteger ; n i f : anchorOf dog ^^xsd : s t r i n g ; n i f : referenceContext http :// abc . com/doc#char=0,27 . i t s r d f : taClassRef dbo : Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33
  • 16. Maintainer validation Lemon Python script 24 tests for structural criteria too slow on big datasets not good reporting NIF SPARQL queries 11 tests for common errors not complete Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33
  • 17. Built on previous work Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in WWW 2014. Horizontal, multi-domain data quality assessment Massive detection of errors for ve large-scale LOD data sets 291 vocabularies, independent of their domain or purpose New contributions: Relation to OWL reasoners Test Driven Data Engineering Ontology Domain-specic validation Quickly improving existing validation options provided by maintainers Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33
  • 18. Test-Driven Data Development Methodology Test case: a data constraint that involves one or more triples Test suite: a set of test cases for testing a dataset Status: Success, Fail, Timeout (complexity) or Error (e.g. network) Fail: Error, warning or notice RDF: basis for both data and schema Unied model facilitates automatic test case generation SPARQL serves as the test case denition language Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33
  • 19. Example test case A nif:RFC5147String should never have a nif:beginIndex greater than nif:endIndex Test cases are written in SPARQL SELECT ? s WHERE { ? s n i f : beginIndex ?v1 . ? s n i f : endIndex ?v2 . FILTER ( ?v1 ?v2 ) } We query for errors Success: Query returns empty result set Fail: Query returns results Every result we get is a violation instance Timeout / Error: needs further investigation on SPARQL Engine capabilities, query syntax or query complexity Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33
  • 20. Patterns Bindings Data Quality Test Patterns (DQTP) abstract patterns, which can be further rened into concrete data quality test cases using test pattern bindings Existing library of 20 patterns SELECT ? s WHERE { ? s %%P1%% ?v1 . ? s %%P2%% ?v2 . FILTER ( ?v1 %%OP%% ?v2 ) } Bindings mapping of variables to valid pattern replacement P1 = n i f : beginIndex | SELECT ? s WHERE { P2 = n i f : endIndex | ? s n i f : beginIndex ?v1 . OP = | ? s n i f : endIndex ?v2 . | FILTER ( ?v1 ?v2 ) } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33
  • 21. Test Auto Generators (TAGs) RDF(s) OWL (partial) support Query schema for supported axioms SELECT DISTINCT ?T1 ?T2 WHERE { ?T1 owl : d i s j o i n t W i t h ?T2 . } For every result a binding to a pattern is generated a test case instantiated Supported axioms at the moment: RDFS: domain range OWL: minCardinality, maxCardinality, cardinality, functionalProperty, InverseFunctionalProperty, disjointClass, propertyDisjointWith, AsymmetricProperty and deprecated Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33
  • 22. Test Case Elicitation Workow Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33
  • 23. TD(D)D vs Reasoners SPARQL test cases detect a subset of validation errors detectable by an OWL reasoner. Limited by SPARQL endpoint reasoning support limitations of the OWL-to-SPARQL translation. SPARQL test cases detect validation errors not expressible in OWL OWL reasoning is often not feasible on large datasets. Datasets are already deployed and accessible via SPARQL endpoints Pattern library more user friendly approach for building validation rules compared to modelling OWL axioms. requires familiarity non-common validations require manual SPARQL test cases Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33
  • 24. Data Engineering Ontology Input / Output entirely in RDF Model the methodology in OWL test suites, test cases, patterns, auto generators Strict to serve as a validation layer Four dierent levels of error reporting simple test case report (success, fail) / enriched with counts violation instance reporting / enriched with annotations Reuse dcterms, prov, spin, rlog Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33
  • 25. Data Engineering Ontology - Denition Generation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33
  • 26. Data Engineering Ontology - Result Representation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33
  • 27. Lemon NIF Test case elicitation RDFUnit Suite implements our methodology Run on Lemon NIF ontologies TAGs could not yet handle some complex owl:Restrictions owl:unionOf, owl:allValuesFrom, owl:someValuesFrom, owl:hasSelf and some rdfs:subPropertyOf cases Manual test cases for constraints not captured in OWL. Total Domain Range Datatype Card. Disj. Func. I. Func. Manual Lemon 182 40 34 1 29 64 3 1 10 NIF 96 42 24 4 6 10 10 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33
  • 28. Example of manual Lemon test case lemon:narrower denotes that one sense of a word is narrower than the other and must never be symmetric or contain cycles. SELECT DISTINCT ? s WHERE { ? s lemon : narrower+ ? narrower . ? narrower lemon : narrower+ ? s . } lemon:language must not have a language tag (RDF1.1 to the rescue) SELECT DISTINCT ? s WHERE { ? s lemon : language ?v1 . FILTER ( lang (? v1 )!=) } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33
  • 29. Example of manual NIF test case Ensure that nif:beginIndex nif:endIndex index are correct SELECT DISTINCT ? s WHERE { ? s n i f : anchorOf ? anchorOf ; n i f : beginIndex ? beginIndex ; n i f : endIndex ? endIndex ; n i f : referenceContext [ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] . BIND (SUBSTR(? r e f e r e n c e S t r i n g , ? beginIndex , (? endIndex − ? beginIndex ) ) AS ? t e s t ) . FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33
  • 30. Evaluation Datasets Name Description Ontology Type lemon datasets LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon, UBY-LMF Dictionary LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon, UBY-LMF Dictionary LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMF model lemon, UBY-LMF WordNet DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary NIF datasets Wikilinks sample of 60976 randomly selected phrases linked to Wikipedia articles NIF NER DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER KORE 50 evaluation dataset 50 NE annotated natural language sentences from the AIDA corpus NIF NER News-100 100 manually annotated German news articles NIF NER RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER Reuters-128 128 news articles manually curated NIF NER Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33
  • 31. Evaluation results Size SC FL TO ER Auto Errors Man Errors MWarn MInfo WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837 WktEN 8M 168 14 - - 752.018 394.766 - 633.270 WktDE 2M 170 12 - - 273.109 66.268 - 155.598 Wordnet 4M 166 16 - - 257.228 36 - 257.204 QHL 3M 170 11 - 1 433.118 538.933 - 538.016 Wikilinks 0.6M 91 4 - 1 141.528 21.246 - - News-100 13K 91 2 - 3 3.510 - - - RSS-500 10K 91 2 - 3 3.000 - - - Reuters-128 7K 91 2 - 3 2.016 - - - Spotlight 3K 92 3 - 1 662 68 - - KORE50 2K 89 6 - 1 301 55 - - Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 31 / 33
  • 32. Conclusion Extended a previously introduced methodology for test-driven quality assessment Data engineering ontology Devised 277 test cases for NLP datasets using the Lemon and NIF vocabularies Revealed a substantial number of errors for Lemon NIF datasets Future directions extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF) automatic dependencies between test cases wrap RDFUnit for NLP services (integrated in NIF) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33
  • 33. Thank you! Dimitris Kontokostas With kind support of John McCrae (Lemon model) http://rdfunit.aksw.org http://github.com/AKSW/RDFUnit #eswc2014kontokostas Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33