Successfully reported this slideshow.
Your SlideShare is downloading. ×

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 31 Ad

More Related Content

Similar to Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022) (20)

More from Péter Király (20)

Advertisement

Recently uploaded (20)

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)

  1. 1. Validating JSON, XML and CSV data with SHACL-like constraints Péter Király, GWDG (Göttingen) pkiraly@gwdg.de Deutsche Initiative für Netzwerkinformation e.V. Kompetenzzentrum Interoperable Metadaten (KIM) Workshop 2022-05-02 https://github.com/pkiraly/metadata-qa-api
  2. 2. Shapes Constraint Language (SHACL) a language for validating RDF graphs against a set of conditions (expressed as RDF graphs) ex:PersonShape a sh:NodeShape ; sh:targetClass ex:Person ; # checks persons sh:property [ sh:path ex:ssn ; # checks social security nr. sh:maxCount 1 ; sh:datatype xsd:string ; sh:pattern "^d{3}-d{2}-d{4}$" ; ] ; Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  3. 3. Metadata Quality Assessment Framework (MQAF) API ★ an open source software for metadata quality assessment ★ quality dimensions: completeness, multilinguality, uniqueness, etc. ★ extensions: Europeana, MARC, Deutsche Digitale Bibliothek ★ Java API + command line interface (in progress) ★ reads XML, JSON, CSV, MARC ★ highly configurable ★ adaptable to different metadata schemas Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  4. 4. RDF agnostic SHACL tests* Cardinality minCount <number>, maxCount <number> Value Range minExclusive <number>, minInclusive <number>, maxExclusive <number>, maxInclusive <number> String minLength <number>, maxLength <number>, hasValue <String>, in [String1, ..., StringN], pattern <regular expression>, minWords <number>, maxWords <number> Comparision of properties equals <field label>, disjoint <field label>, lessThan <field label>, lessThanOrEquals <field label> Logical operators and [<rule1>, ..., <ruleN>], or [<rule1>, ..., <ruleN>], not [<rule1>, ..., <ruleN>] Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api * a subset of SHACL
  5. 5. MQAF API’s SHACL tests Cardinality minCount <number>, maxCount <number> Value Range minExclusive <number>, minInclusive <number>, maxExclusive <number>, maxInclusive <number> String minLength <number>, maxLength <number>, hasValue <String>, in [String1, ..., StringN], pattern <regular expression>, minWords <number>, maxWords <number> Comparision of properties equals <field label>, disjoint <field label>, lessThan <field label>, lessThanOrEquals <field label> Logical operators and [<rule1>, ..., <ruleN>], or [<rule1>, ..., <ruleN>], not [<rule1>, ..., <ruleN>] extras contentType [type1, ..., typeN], unique <boolean>, dependencies [id1, id2, ..., idN], dimension [criteria...] (min/max + Width/Height/Shortside/Longside) properties id, description, failureScore, successScore, hidden, skip Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  6. 6. abstracting the address of data element XML JSON CSV MARC21 have addressable data elements (branches) XPath JSONPath column names MARCSpec addressing languages Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  7. 7. schema definition abstracting data element retrieval XML JSON CSV MARC21 data element selector uniform data structure May I get the title? Title’s address is //head/title Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  8. 8. schema definition Schema schema = new BaseSchema() .setFormat(Format.CSV) .addField( new JsonBranch("title", "title") .setRule( new Rule() .withDisjoint("description"))) .addField( new JsonBranch("url", "url") .setExtractable(true) .setRule( new Rule() .withMinCount(1) .withMaxCount(1) .withPattern("^https?://.*$"))) format: csv fields: - name: title rules: disjoint: description - name: url extractable: true rules: minCount: 1 maxCount: 1 pattern: ^https?://.*$ Java API YAML configuration file { “format”: “csv”, “fields”: [ { “name”: “title”, “rules”: [ {“disjoint”: “description”} ] }, { “name”: “url”, “extractable”: true, “rules”: [ { “minCount”: 1, “maxCount”: 1, “pattern”: “^https?://.*$”}]} JSON configuration file Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  9. 9. one and only one data element instance - name: about path: $.['about'] rules: - minCount: 1 - maxCount: 1 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  10. 10. numeric value constraints - name: price path: $.['price'] rules: - and: - minInclusive: 1.0 - maxInclusive: 2.0 - name: price path: $.['price'] rules: - and: - minExclusive: 1.0 - maxExclusive: 2.0 1.0 <= price <= 2.0 1.0 < price < 2.0 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  11. 11. string constraints / length - name: about path: $.['about'] rules: - minLength: 1 - name: about path: $.['about'] rules: - and: - minLength: 3 - maxLength: 5 lenght(about) >= 1 5 >= lenght(about) >= 3 - name: status path: $.['status'] rules: - hasValue: published status == “published” - name: type path: $.['type'] rules: - in: [dataverse, dataset, file] type == “dataverse” or type == “dataset” or type == “file” - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$ thumbnail is an image or PDF file - name: about path: $.['about'] rules: - minWords: 1 nr_words(about) >= 2 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  12. 12. string constraints / fixed values - name: about path: $.['about'] rules: - minLength: 1 - name: about path: $.['about'] rules: - and: - minLength: 3 - maxLength: 5 lenght(about) >= 1 5 >= lenght(about) >= 3 - name: status path: $.['status'] rules: - hasValue: published status == “published” - name: type path: $.['type'] rules: - in: [dataverse, dataset, file] type == “dataverse” or type == “dataset” or type == “file” - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$ thumbnail is an image or PDF file - name: about path: $.['about'] rules: - minWords: 1 nr_words(about) >= 2 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  13. 13. string constraints / pattern - name: about path: $.['about'] rules: - minLength: 1 - name: about path: $.['about'] rules: - and: - minLength: 3 - maxLength: 5 lenght(about) >= 1 5 >= lenght(about) >= 3 - name: status path: $.['status'] rules: - hasValue: published status == “published” - name: type path: $.['type'] rules: - in: [dataverse, dataset, file] type == “dataverse” or type == “dataset” or type == “file” - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$ thumbnail is an image or PDF file - name: about path: $.['about'] rules: - minWords: 1 nr_words(about) >= 2 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  14. 14. string constraints / number or words - name: about path: $.['about'] rules: - minLength: 1 - name: about path: $.['about'] rules: - and: - minLength: 3 - maxLength: 5 lenght(about) >= 1 5 >= lenght(about) >= 3 - name: status path: $.['status'] rules: - hasValue: published status == “published” - name: type path: $.['type'] rules: - in: [dataverse, dataset, file] type == “dataverse” or type == “dataset” or type == “file” - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - pattern: ^https?://.*.(jpe?g||png|tiff?|gif)$ thumbnail is an image or PDF file - name: about path: $.['about'] rules: - minWords: 2 nr_words(about) >= 2 Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  15. 15. comparisions of data elements fields: - name: id path: $.['id'] rules: - equals: isbn - name: isbn path: $.['isbn'] fields: - name: title path: $.['title'] rules: - disjoint: description - name: description path: $.['description'] - name: startingPage path: startingPage rules: - lessThanOrEquals: endingPage id == isbn title != description startingPage <= endingPage Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  16. 16. comparisions of data elements fields: - name: id path: $.['id'] rules: - equals: isbn - name: isbn path: $.['isbn'] fields: - name: title path: $.['title'] rules: - disjoint: description - name: description path: $.['description'] - name: startingPage path: startingPage rules: - lessThanOrEquals: endingPage id == isbn title != description startingPage <= endingPage Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  17. 17. comparisions of data elements fields: - name: id path: $.['id'] rules: - equals: isbn - name: isbn path: $.['isbn'] fields: - name: title path: $.['title'] rules: - disjoint: description - name: description path: $.['description'] - name: startingPage path: startingPage rules: - lessThanOrEquals: endingPage id == isbn title != description startingPage <= endingPage Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  18. 18. logical operations - name: id path: oai:record/dc:identifier rules: - and: - minCount: 1 - maxCount: 1 - minLength: 1 - name: thumbnail path: oai:record/dc:identifier rules: - or: - pattern: ^.*.(jpe?g|png|)$ - contentType: - image/jpeg - image/png - name: title path: $.['title'] rules: - not: - equals: description and or not Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  19. 19. logical operations - name: id path: oai:record/dc:identifier rules: - and: - minCount: 1 - maxCount: 1 - minLength: 1 - name: thumbnail path: oai:record/dc:identifier rules: - or: - pattern: ^.*.(jpe?g|png|)$ - contentType: - image/jpeg - image/png - name: title path: $.['title'] rules: - not: - equals: description and or not Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  20. 20. logical operations - name: id path: oai:record/dc:identifier rules: - and: - minCount: 1 - maxCount: 1 - minLength: 1 - name: thumbnail path: oai:record/dc:identifier rules: - or: - pattern: ^.*.(jpe?g|png|)$ - contentType: - image/jpeg - image/png - name: title path: $.['title'] rules: - not: - equals: description and or not Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  21. 21. extras - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - contentType: [image/jpeg, image/png, …] content type - name: id path: oai:record/dc:identifier rules: - unique: true - name: url path: oai:record/dc:identifier[@type='URL'] rules: - id: Q-4.4 description: Both a media file and a link to an object are referenced in context. dependencies: [Q-3.0, Q-4.0] - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - id: 3.1 dimension: minWidth: 200 minHeight: 200 only if other test has been passed image dimensions unique value Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  22. 22. extras - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - contentType: [image/jpeg, image/png, …] content type - name: id path: oai:record/dc:identifier rules: - unique: true - name: url path: oai:record/dc:identifier[@type='URL'] rules: - id: Q-4.4 description: Both a media file and a link to an object are referenced in context. dependencies: [Q-3.0, Q-4.0] - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - id: 3.1 dimension: minWidth: 200 minHeight: 200 only if other test has been passed image dimensions unique value Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  23. 23. extras - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - contentType: [image/jpeg, image/png, …] content type - name: id path: oai:record/dc:identifier rules: - unique: true - name: url path: oai:record/dc:identifier[@type='URL'] rules: - id: Q-4.4 description: Both a media file and a link to an object are referenced in context. dependencies: [Q-3.0, Q-4.0] - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - id: 3.1 dimension: minWidth: 200 minHeight: 200 only if other test has been passed image dimensions unique value Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  24. 24. extras - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - contentType: [image/jpeg, image/png, …] content type - name: id path: oai:record/dc:identifier rules: - unique: true - name: url path: oai:record/dc:identifier[@type='URL'] rules: - id: Q-4.4 description: Both a media file and a link to an object are referenced in context. dependencies: [Q-3.0, Q-4.0] - name: thumbnail path: oai:record/dc:identifier[@type='binary'] rules: - id: 3.1 dimension: minWidth: 200 minHeight: 200 only if other test has been passed image dimensions unique value Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  25. 25. other properties id identifier, used in output, and in internal references description explain what the rule checks failureScore a numerical score assigned if the test fails successScore a numerical score assigned if the test passes hidden run the test, but hides from the output skip do not run the test now (for debugging reason) Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  26. 26. raw output ★ for each tests: ○ status: PASSED, FAILED, NA (if the data element is not available) ○ score: the output of successScore (if passed), failureScore (if failed) or 0 ★ total score The output could be CSV, JSON or Java objects (configurable) Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  27. 27. visualization for metadata managers / single record Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  28. 28. aggregation Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  29. 29. status and scores Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  30. 30. workflow 1. ingest 2. measure records 3. aggregate 4. report 5. evaluate with experts catalogue improve records quality assessment tool Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api
  31. 31. research partners early adopters and contributors ★ Miel Vander Sande (meemoo, Belgium) ★ Richard Palmer (Victoria and Albert Museum, Great Britain) Deutsche Digitale Bibliothek ★ Francesca Schulze ★ Cosmina Berta ★ Stefanie Rühle ★ Claudia Effenberger ★ Letitia-Venetia Mölck special thanks ★ Juliane Stiller Validating JSON, XML and CSV data with SHACL-like constraints https://github.com/pkiraly/metadata-qa-api

×