Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Best Practices: Data Admin & Data Management
Next
Download to read offline and view in fullscreen.

2

Share

NLP Data Cleansing Based on Linguistic Ontology Constraints

Download to read offline

Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints

Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

NLP Data Cleansing Based on Linguistic Ontology Constraints

  1. 1. NLP Data Cleansing Based on Linguistic Ontology Constraints Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13 Jens Lehmann1 Lazaros Ioannidis2 1AKSW, University of Leipzig 2Aristotle University of Thessaloniki 3DBpedia Association 2014-05-27 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33
  2. 2. LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33
  3. 3. LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33
  4. 4. Linguistic Communities Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33
  5. 5. Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33
  6. 6. Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33
  7. 7. Linguistic LOD Cloud (LLOD Cloud) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33
  8. 8. Problem denition Linguistic (related) Data Purpose-Driven denition Increasing Data, ontologies vocabularies New-comers → hard to understand the ontologies / follow updates Validation is essential Many dierent pipelines (parsing, annotation, disambiguation, etc) Errors are propagated Partially provided by maintainers (incomplete) Focus on Lemon NIF (proof of concept) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33
  9. 9. Lemon - Lexicon Model for Ontologies Models lexicon and machine-readable dictionaries RDF-native form Linguistically sound structure (LMF) Separation of the lexicon and ontology layers Linking to data categories → arbitrarily complex linguistic description Principle of least power - the less expressive the language, the more reusable the data. http://lemon-model.net/ Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33
  10. 10. Lemon - Example : l e x i c o n a lemon : Lexicon ; lemon : entry : Pizza , : T o r t i l l a . : Pizza a lemon : LexicalEntry ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource /Pizza ] . : T o r t i l l a a lemon : LexicalEntry ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource / T o r t i l l a ] . Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33
  11. 11. Lemon - Example (Correct) : l e x i c o n a lemon : Lexicon ; lemon : language en ; lemon : entry : Pizza , : T o r t i l l a . : Pizza a lemon : LexicalEntry ; lemon : canonicalForm [ lemon : writtenRep Pizza @en ] ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource /Pizza ]. : T o r t i l l a a lemon : LexicalEntry ; lemon : canonicalForm [ lemon : writtenRep T o r t i l l a @en ] ; lemon : sense [ lemon : r e f e r e n c e http :// dbpedia . org / resource / T o r t i l l a ]. Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33
  12. 12. NIF - NLP Interchange Format RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations In a nutshell: Logical formalisation of strings and annotations Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147 Reuse of RDF tool stack Decreases development cost for integration Integrated in: DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator, ConLL converter , ... Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33
  13. 13. NIF - Overview Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33
  14. 14. NIF - Example http :// abc . com/doc#char=0,17 a n i f : Context ; a n i f : RFC147String ; n i f : beginIndex 0 ; n i f : endIndex 17 ; n i f : i s S t r i n g My dog l i k e s pizza . http :// abc . com/doc#char=2,7 a n i f : RFC5147String ; n i f : anchorOf dog ; n i f : referenceContext http :// abc . com/doc#char=0,17 . i t s r d f : taClassRef dbo : Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33
  15. 15. NIF - Example (Correct) http :// abc . com/doc#char=0,18 a n i f : Context ; a n i f :RFC5147 String ; n i f : beginIndex 0^^xsd : nonNegativeInteger ; n i f : endIndex 18^^xsd : nonNegativeInteger ; n i f : i s S t r i n g My dog l i k e s pizza ^^xsd : s t r i n g . http :// abc . com/doc#char=2,7 a n i f : RFC5147String ; n i f : beginIndex 2^^xsd : nonNegativeInteger ; n i f : endIndex 7^^xsd : nonNegativeInteger ; n i f : anchorOf dog ^^xsd : s t r i n g ; n i f : referenceContext http :// abc . com/doc#char=0,27 . i t s r d f : taClassRef dbo : Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33
  16. 16. Maintainer validation Lemon Python script 24 tests for structural criteria too slow on big datasets not good reporting NIF SPARQL queries 11 tests for common errors not complete Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33
  17. 17. Built on previous work Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in WWW 2014. Horizontal, multi-domain data quality assessment Massive detection of errors for ve large-scale LOD data sets 291 vocabularies, independent of their domain or purpose New contributions: Relation to OWL reasoners Test Driven Data Engineering Ontology Domain-specic validation Quickly improving existing validation options provided by maintainers Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33
  18. 18. Test-Driven Data Development Methodology Test case: a data constraint that involves one or more triples Test suite: a set of test cases for testing a dataset Status: Success, Fail, Timeout (complexity) or Error (e.g. network) Fail: Error, warning or notice RDF: basis for both data and schema Unied model facilitates automatic test case generation SPARQL serves as the test case denition language Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33
  19. 19. Example test case A nif:RFC5147String should never have a nif:beginIndex greater than nif:endIndex Test cases are written in SPARQL SELECT ? s WHERE { ? s n i f : beginIndex ?v1 . ? s n i f : endIndex ?v2 . FILTER ( ?v1 ?v2 ) } We query for errors Success: Query returns empty result set Fail: Query returns results Every result we get is a violation instance Timeout / Error: needs further investigation on SPARQL Engine capabilities, query syntax or query complexity Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33
  20. 20. Patterns Bindings Data Quality Test Patterns (DQTP) abstract patterns, which can be further rened into concrete data quality test cases using test pattern bindings Existing library of 20 patterns SELECT ? s WHERE { ? s %%P1%% ?v1 . ? s %%P2%% ?v2 . FILTER ( ?v1 %%OP%% ?v2 ) } Bindings mapping of variables to valid pattern replacement P1 = n i f : beginIndex | SELECT ? s WHERE { P2 = n i f : endIndex | ? s n i f : beginIndex ?v1 . OP = | ? s n i f : endIndex ?v2 . | FILTER ( ?v1 ?v2 ) } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33
  21. 21. Test Auto Generators (TAGs) RDF(s) OWL (partial) support Query schema for supported axioms SELECT DISTINCT ?T1 ?T2 WHERE { ?T1 owl : d i s j o i n t W i t h ?T2 . } For every result a binding to a pattern is generated a test case instantiated Supported axioms at the moment: RDFS: domain range OWL: minCardinality, maxCardinality, cardinality, functionalProperty, InverseFunctionalProperty, disjointClass, propertyDisjointWith, AsymmetricProperty and deprecated Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33
  22. 22. Test Case Elicitation Workow Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33
  23. 23. TD(D)D vs Reasoners SPARQL test cases detect a subset of validation errors detectable by an OWL reasoner. Limited by SPARQL endpoint reasoning support limitations of the OWL-to-SPARQL translation. SPARQL test cases detect validation errors not expressible in OWL OWL reasoning is often not feasible on large datasets. Datasets are already deployed and accessible via SPARQL endpoints Pattern library more user friendly approach for building validation rules compared to modelling OWL axioms. requires familiarity non-common validations require manual SPARQL test cases Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33
  24. 24. Data Engineering Ontology Input / Output entirely in RDF Model the methodology in OWL test suites, test cases, patterns, auto generators Strict to serve as a validation layer Four dierent levels of error reporting simple test case report (success, fail) / enriched with counts violation instance reporting / enriched with annotations Reuse dcterms, prov, spin, rlog Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33
  25. 25. Data Engineering Ontology - Denition Generation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33
  26. 26. Data Engineering Ontology - Result Representation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33
  27. 27. Lemon NIF Test case elicitation RDFUnit Suite implements our methodology Run on Lemon NIF ontologies TAGs could not yet handle some complex owl:Restrictions owl:unionOf, owl:allValuesFrom, owl:someValuesFrom, owl:hasSelf and some rdfs:subPropertyOf cases Manual test cases for constraints not captured in OWL. Total Domain Range Datatype Card. Disj. Func. I. Func. Manual Lemon 182 40 34 1 29 64 3 1 10 NIF 96 42 24 4 6 10 10 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33
  28. 28. Example of manual Lemon test case lemon:narrower denotes that one sense of a word is narrower than the other and must never be symmetric or contain cycles. SELECT DISTINCT ? s WHERE { ? s lemon : narrower+ ? narrower . ? narrower lemon : narrower+ ? s . } lemon:language must not have a language tag (RDF1.1 to the rescue) SELECT DISTINCT ? s WHERE { ? s lemon : language ?v1 . FILTER ( lang (? v1 )!=) } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33
  29. 29. Example of manual NIF test case Ensure that nif:beginIndex nif:endIndex index are correct SELECT DISTINCT ? s WHERE { ? s n i f : anchorOf ? anchorOf ; n i f : beginIndex ? beginIndex ; n i f : endIndex ? endIndex ; n i f : referenceContext [ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] . BIND (SUBSTR(? r e f e r e n c e S t r i n g , ? beginIndex , (? endIndex − ? beginIndex ) ) AS ? t e s t ) . FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . } Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33
  30. 30. Evaluation Datasets Name Description Ontology Type lemon datasets LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon, UBY-LMF Dictionary LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon, UBY-LMF Dictionary LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMF model lemon, UBY-LMF WordNet DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary NIF datasets Wikilinks sample of 60976 randomly selected phrases linked to Wikipedia articles NIF NER DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER KORE 50 evaluation dataset 50 NE annotated natural language sentences from the AIDA corpus NIF NER News-100 100 manually annotated German news articles NIF NER RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER Reuters-128 128 news articles manually curated NIF NER Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33
  31. 31. Evaluation results Size SC FL TO ER Auto Errors Man Errors MWarn MInfo WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837 WktEN 8M 168 14 - - 752.018 394.766 - 633.270 WktDE 2M 170 12 - - 273.109 66.268 - 155.598 Wordnet 4M 166 16 - - 257.228 36 - 257.204 QHL 3M 170 11 - 1 433.118 538.933 - 538.016 Wikilinks 0.6M 91 4 - 1 141.528 21.246 - - News-100 13K 91 2 - 3 3.510 - - - RSS-500 10K 91 2 - 3 3.000 - - - Reuters-128 7K 91 2 - 3 2.016 - - - Spotlight 3K 92 3 - 1 662 68 - - KORE50 2K 89 6 - 1 301 55 - - Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 31 / 33
  32. 32. Conclusion Extended a previously introduced methodology for test-driven quality assessment Data engineering ontology Devised 277 test cases for NLP datasets using the Lemon and NIF vocabularies Revealed a substantial number of errors for Lemon NIF datasets Future directions extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF) automatic dependencies between test cases wrap RDFUnit for NLP services (integrated in NIF) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33
  33. 33. Thank you! Dimitris Kontokostas With kind support of John McCrae (Lemon model) http://rdfunit.aksw.org http://github.com/AKSW/RDFUnit #eswc2014kontokostas Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33
  • ssusere5df571

    May. 17, 2017
  • amrapalijz

    Jan. 22, 2015

Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.

Views

Total views

1,733

On Slideshare

0

From embeds

0

Number of embeds

27

Actions

Downloads

27

Shares

0

Comments

0

Likes

2

×