SHACL: Shaping the Big Ball of Data Mud

Principal Software Engineer at TopQuadrant
Nov. 19, 2016
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
SHACL: Shaping the Big Ball of Data Mud
1 of 50

More Related Content

What's hot

Introduction to RDFIntroduction to RDF
Introduction to RDFNarni Rajesh
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa Elzoghbi
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesJose Emilio Labra Gayo
SHACL in Apache jena - ApacheCon2020SHACL in Apache jena - ApacheCon2020
SHACL in Apache jena - ApacheCon2020andyseaborne
SPARQL 사용법SPARQL 사용법
SPARQL 사용법홍수 허
HyperGraphQLHyperGraphQL
HyperGraphQLSzymon Klarman

Similar to SHACL: Shaping the Big Ball of Data Mud

RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesKurt Cagle
A Hands On Overview Of The Semantic WebA Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebShamod Lacoul
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...LDBC council
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016Dr.-Ing. Thomas Hartmann
A hands on overview of the semantic webA hands on overview of the semantic web
A hands on overview of the semantic webMarakana Inc.
What's New in RDF 1.1?What's New in RDF 1.1?
What's New in RDF 1.1?Richard Cyganiak

More from Richard Cyganiak

EDF2012: The Web of Data and its Five StarsEDF2012: The Web of Data and its Five Stars
EDF2012: The Web of Data and its Five StarsRichard Cyganiak
VoID: Metadata for RDF DatasetsVoID: Metadata for RDF Datasets
VoID: Metadata for RDF DatasetsRichard Cyganiak
Practical Cross-Dataset Queries with SPARQL (Introduction)Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)Richard Cyganiak
How to Publish Open DataHow to Publish Open Data
How to Publish Open DataRichard Cyganiak
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationRichard Cyganiak
Investigating Community Implementation of the GoodRelations OntologyInvestigating Community Implementation of the GoodRelations Ontology
Investigating Community Implementation of the GoodRelations OntologyRichard Cyganiak

Recently uploaded

Orchestration, Automation and Virtualisation Maturity ModelOrchestration, Automation and Virtualisation Maturity Model
Orchestration, Automation and Virtualisation Maturity ModelCSUC - Consorci de Serveis Universitaris de Catalunya
class and object in c++.pptxclass and object in c++.pptx
class and object in c++.pptxAdarsh College, Hingoli
Swiss Re Reinsurance Solutions - Automated Claims Experience – Insurer Innova...Swiss Re Reinsurance Solutions - Automated Claims Experience – Insurer Innova...
Swiss Re Reinsurance Solutions - Automated Claims Experience – Insurer Innova...The Digital Insurer
AI for Da'wah (Da'wah for AI)AI for Da'wah (Da'wah for AI)
AI for Da'wah (Da'wah for AI)Muhammad Angga Muttaqien
Charity Navigator Masterclass - Accountability and Finance BeaconCharity Navigator Masterclass - Accountability and Finance Beacon
Charity Navigator Masterclass - Accountability and Finance BeaconOnBoard
ContainerDays Hamburg 2023 — Cilium Workshop.pdfContainerDays Hamburg 2023 — Cilium Workshop.pdf
ContainerDays Hamburg 2023 — Cilium Workshop.pdfRaphaël PINSON

Recently uploaded(20)

SHACL: Shaping the Big Ball of Data Mud

Editor's Notes

  1. It’s amazing how many people have done incredible work. Massive effort shown in this pic. But there is some hype. Quite a few datasets are a sloppy conversion script, results thrown into a SPARQL store, with some haphazard links to DBpedia. Run a handful of SPARQL queries as sanity checks. But no in-depth quality control at all. Lots of data quality issues. Querying within a dataset can be hard enough, across datasets often impossible. If one dataset (e.g., DBpedia) changes, links break and often are never fixed.
  2. Talk will be about validation and SHACL, but I’d like to start by setting the scene Where is the Semantic Web on the hype cycle? Arguably, it went over the bump twice already: with focus on logic/AI around 2000, and focus on Linked Data around 2010. I helped to fan the flames of the second hype. The base standards: Today it's no longer that exciting. Overblown expectations have cooled off. It’s no longer expected to change the world. Getting stable and mature. Specific applications can be elsewhere on the cycle. See “Enterprise Taxonomy and Ontology Management”. That’s actually what TQ does.
  3. If you work with these technologies, life is pretty good these days, and still getting better. Maturing standards and tool support. And today we really understand what the technologies are good at, and what not.
  4. “Maps poorly to programming languages”: property names are not simple identifiers, every property can be multivalued; need navigability along incoming and outgoing arcs; ordering is difficult Semi-structured is important in big data
  5. We know where it works and where it doesn’t. It’s productive in a number of niches. RDF is good at dealing with Variety. (But not good enough: contextual validation, fuzzy/statistical matching for the semi-structured stuff) Variety tends to make logic approaches difficult—no single global truth—less OWL, more SPARQL
  6. Tim Berners-Lee deconstructing a bag of crisps Perfect metaphor for the strengths of the SW Different information co-exists on the packaging: the plain English “potato chips” the nutrition information on the back, standardized by the U.S. food and drug administration some allergy information that many people don’t pay any attention to, but those with allergies read very carefully. the UPC code that can be read by any retail checkout machine in the world some numbers on the bottom edge of the package that make no sense to him whatsoever. Mixing and matching of different vocabularies, standardised by different organisations, intended for different consumers. Partial understanding. Once you have agreed on an identifier for a thing *and a location for data about it*, different data producers and consumers can use it without stepping on each others' toes.
  7. The two main open source implementations of the technology stack, Jena and Sesame, are now at the Apache Foundation and at the Eclipse Foundation—big, established, mature, enterprisey organisations.
  8. So life is pretty good. Maturing technology stack, clearly understood strengths and weaknesses, productive niches, improving tools. But… We never solved validation. That’s kind of surprising. After all, each of these technologies has aspects that address these needs. Review one by one
  9. Every class and property has a URI. The URI references an ontology that defines the term. So each triple describes itself, right? One of the major strengths, right? No. Actually, most of the meaning is just not given in the ontology. Too much of the meaning is implicit, or just written down in text somewhere and cannot be automatically checked. Let me give examples.
  10. Arguably most important ontology in existence. Examples of things they want to validate in a tool for webmasters, came out of the workshop that kicked off the Data Shapes WG See https://www.w3.org/2001/sw/wiki/images/0/00/SimpleApplication-SpecificConstraintsforRDFModels.pdf https://www.w3.org/TR/shacl-ucr/#uc23-schema.org-constraints
  11. DC is widely used. It’s easy enough to agree on calling a title “dc:title” and an author “dc:creator”, but different orgs have widely differing views on what constitutes a complete metadata record. DC Application profiles as a response. DC developed its own way to represent those. Not a standard, not used apart from the DC community.
  12. I’ve been involved. We wrote constraint in prose, and added SPARQL queries to make it more formal/explicit. And yes people can copy-paste them. But still no way of just running all of them automatically against a published dataset! And no error reporting—just true/false.
  13. I’ve been involved. Mapping files are written in RDF; goal was to be very clear about what constitutes a valid mapping file. This is semi-formal. Surely this should be representable in some standard machine-readable way?
  14. Read/write Linked Data. Applications want to put constraints on the kind of data they can receive. Address book application wants to say that there should be an address in the RDF you PUT/POST. But completely punted on saying how to achieve it. “machine-readable ones facilitate better client interaction”—no shit!
  15. So, lots of initiatives that are serious about using SW in an interoperable and robust way end up just putting constraints in prose text, where it should really be in a machine-processable form. Same problem everywhere! But we have RDF Schema. SCHEMA!
  16. RDF Schema in analogy to XML Schema, but they really do very different things.
  17. So RDFS is just not powerful enough. But OWL surely gets us there?
  18. Clark&Parsia. Use OWL syntax, but switch to a semantics based on CW and UNA. This works pretty well! But can be a bit confusing—if you find some OWL, what semantics is intended? And OWL, while expressive, lacks some things that one would like to have in validation.
  19. We saw the Data Cube example where SPARQL was used to query the graph to see if it’s complete. Isn’t that enough to solve all validation issues?
  20. SPIN is a technology introduced by TQ. A bunch of things (rules written in SPARQL, templated SPARQL queries, defining custom SPARQL functions, etc.) We have used this for years and it actually works very well.
  21. Custom syntax. Somewhere between SPARQL, regular expressions, and grammar parsing. “regex for graphs.” Pretty cool. Concise. Needs new parsers.
  22. So, several good solutions around—but none has enough mindshare to take over. Meet at W3C, make a standard with the best aspects of each. (Or with the worst aspects of each—fingers crossed.)
  23. Some features and aspects are still highly controversial.
  24. When a violation occurs, the result is not just “false”. It’s a structure with info. Can process it in various ways. Just display it? Attach it to the right form field based on sh:path? Just count the violations per type in a large dataset? Different behaviour for different severitites?
  25. It’s still early days. Mostly individuals and organisations that are active in the working group.
  26. SHACL is getting really important to our products. We have made a major contribution. TQ’s Holger Knublauch is one of the editors of the spec. TBC is an SW IDE, workbench for SW professionals. At its heart is a schema/ontology editor. It supports editing of SHACL constraints through nice UI.
  27. EVN is a taxonomy and ontology management platform. EDG a data governance solution. SHACL allows our customers to add custom constraints over their own data models. Very powerful.
  28. Note the suggestions for fixing the problem. Goes beyond standard SHACL but an obvious addition and very cool.
  29. Not sure how well maintained and if it’s following the spec.
  30. Semantic Web Company. Nice application that shows using SHACL for bulk validation. Automated repair—somewhat similar to our suggestions extension.
  31. Dimitris Kontokostas (SHACL spec editor) and team at Uni Leipzig. Organising entire test suites, expressed originally in SPARQL but now with SHACL support, for data quality of large data sets. Used in context of DBpedia.
  32. So, how do the parts of the stack fit together? High-level view. Let’s run with the metaphor that anyone can say anything about anything. First we should note: Just because you can say anything about anything doesn’t mean you should! RDF is triples. But we also call them RDF statements. Each triple is a statement of some fact.
  33. It’s amazing how many people have done incredible work. Massive effort shown in this pic. But there is some hype. Quite a few datasets are a sloppy conversion script, results thrown into a SPARQL store, with some haphazard links to DBpedia. Run a handful of SPARQL queries as sanity checks. But no in-depth quality control at all. Lots of data quality issues. Querying within a dataset can be hard enough, across datasets often impossible. If one dataset (e.g., DBpedia) changes, links break and often are never fixed.