The Semantic Web research community understood since its beginning how crucial it is to equip practitioners with methods to transform non-RDF resources into RDF. Proposals focus on either engineering content transformations or accessing non-RDF resources with SPARQL. Existing solutions require users to learn specific mapping languages (e.g. RML), to know how to query and manipulate a variety of source formats (e.g. XPATH, JSON-Path), or to combine multiple languages (e.g. SPARQL Generate). In this paper, we explore an alternative solution and contribute a general-purpose meta-model for converting non-RDF resources into RDF: Facade-X. Our approach can be implemented by overriding the SERVICE operator and does not require to extend the SPARQL syntax. We compare our approach with the state of art methods RML and SPARQL Generate and show how our solution has lower learning demands and cognitive complexity, and it is cheaper to implement and maintain, while having comparable extensibility and efficiency.
Facade-X: an opinionated approach to SPARQL anything
1. Facade-X
an opinionated approach to SPARQL Anything
Enrico Daga, Luigi Asprino, Paul Mulholland, Aldo Gangemi
Semantics Conference, Amsterdam, 6/9/2021
https://arxiv.org/pdf/2106.02361.pdf
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement GA101004746.
The communication reflects only the author’s view and the Research Executive Agency is not
responsible for any use that may be made of the information it contains.
2. Playing the soundtrack of our history
Preserving musical heritage
through knowledge graphs
Managing musical heritage collections
through knowledge graphs
Studying musical heritage through
(interlinked) knowledge graphs
https://spice-h2020.eu/ https://polifonia-project.eu/
4. Background
• A recent survey on KG benchmarking says data Integration is the dominant
use case for KG - [Atkin, 2021, in Lassila et al, 2021]
• Semantic Web always concerned with methods to “lift” legacy content to
RDF:
• Targeting specific types/formats: Direct mapping, Tarql, Any23,
JSON2RDF, CSV2RDF, COW, SPARQL Micro-services [Michel, 2019]
• Mapping languages, several types of (RML…, ShexML): high
learning demands. [Dimou, 2014] [García-González, 2020]
• Extending SPARQL with custom features: SPARQL Generate, high
learning demands, difficult to extend to other formats. [Lefrançois,
2017]
• Cognitive complexity: solutions transfer data source complexity to the user
(e.g. know XPath for XML, JsonPath for JSON, …)
• End-user development [Lieberman, 2006]. Many SPARQL users fall into
the category of end-user developer. In a recent survey [Warren et al, 2018]
42% SPARQL users are from non-IT areas, including social sciences and
the humanities, business and economics, and biomedical, engineering or
physical sciences.
5. Knowledge Graph Construction from structured resources
Iterative process:
• Observe: the resource (e.g. a CSV file)
• Design mappings to a target ontology
• Transform: execute the mappings
• Observe: compare / evaluate
Trail and error approach, many iterations
6. Knowledge Graph Construction, revisited
KG construction is a twofold job:
• perform a syntax/meta-model conversion (e.g. CSV to RDF)
• project semantics onto the data (applying a domain ontology)
A better model of the user cognitive process:
• Observe: the resource (e.g. a CSV file)
• Reengineering Design: what syntax/meta-model do I want?
• Remodelling Design: what semantics can we project?
• Transform: execute the mappings
• Observe: compare / evaluate
Trail and error approach, many iterations
7. Knowledge Graph Construction
an opinionated approach
• Reengineering: what syntax/meta-model do we
want?
• We cannot know what structure our user
wants but we know the meta-model: RDF
• Remodelling: what semantics do we project?
• SPARQL is great for projecting semantics
(change namespaces, create entities from
literals, adding types, sophisticated
relationships, composite structures, …)
• Can we use just SPARQL to do all of it?
@enridaga
8. Concept
Facade Design Pattern
From Object Oriented Programming
A single abstraction on different, alternative interfaces
https://en.wikipedia.org/wiki/Facade_pattern
An RDF facade?
• A common RDF structure over diverse
formats
• Focusing on the meta-model (data structure)
• Leaving domain semantics as-it-is!
• apply the least possible “ontological
commitment”
• Problem Space: CSV, JSON, HTML, XML,
Binary (JPEG, PNG, …),Text
• Solution space: RDFS
17. Current features v0.3.0 (some not in the paper!)
• XML, JSON, CSV, HTML, Excel, Text, Binary, EXIF, File System, Archives
• Query templates / parameter queries (BASIL variables)
• Fully customisable HTTP requests, with authentication
• Support for pagination
• Helper functions for sequences: fx:anySlot, fx:before, fx:after, …
• Mix and nest SERVICE clauses (thanks to SPARQL)
• Use SPARQL Results Sets as input for parametric queries
• Combine multiple SPARQL queries in pipelines
• 100% open source, Apache Licence 2.0
• Implemented on top of Apache Jena ARQ
https://sparql-anything.cc/
19. Evaluation
• Qualitative discussion of solutions vs requirements (see the paper for details)
• Quantitative analysis of the cognitive complexity
• Quantitative analysis of performance, to assess practicability
• Experiments with real-world open data in the cultural heritage domain
• RML (Java RML Mapper): composing format specific languages in
declarative mappings
• SPARQL Generate: composing format specific languages extending
SPARQL
• SPARQL Anything: naive, in-memory implementation
• https://github.com/SPARQL-Anything/sparql.anything/tree/main/experiment
20. Transform several sources having heterogeneous formats
• + Spreadsheet
• + File system
• + Archives
• + EXIF metadata
Embedding binary data in literals with base64 encoding
Full fledged HTTP client to query Web APIs
WARN! No support for RDB yet
21. Low learning demands:
• No new language has to be learned as data can be queried using SPARQL 1.1
Meaningful abstraction:
• No need to know the technicalities of source formats
• No need to know JsonPath, XPath, …
• Resources can be accessed as-if they are RDF
Explorability:
• With SPARQL Generate and RML, the user needs to commit to a particular mapping or
transformation of the source data into RDF.
• Facade-X enables the user to avoid prematurely committing to a mapping (Observe!).
22. Low (cognitive) complexity
• One measure of complexity is the number of
(distinct) items or variables (Halford et al. 2004;
Warren et al. 2015).
• 8 CQ (vs SPARQL Generate)
• What are the titles of the artworks attributed to “ANONIMO”?
• What are the titles of the artworks created in the 1935?
• …
• 4 transformations (vs RML and SPARQL Generate)
• Avg distinct tokens:
• SPARQL Anything: ~18
• SPARQL Generate: ~25 (∼39.72% more)
• RML: ~45 (∼150% more)
23. Practicable and sustainable
• Quantitative analysis of performance, to assess practicability
• In-Memory implementation (Naive)
• Execution time of q1-q12
• AVG on 10 executions
• Comparable to RML Mapper and SPARQL Generate on files
up to 1M JSON objects (~5M triples)
• In-Memory implementation scales linearly
• The approach is practicable
• Research on performance as future work
• Lines of Java code to maintain: SPARQL Generate 12280
(core module); RML Mapper 7951; SPARQL Anything: 3842
(all transformers) — v0.2.0 (v0.3.0 has ~11k)
24. Integrates with a typical SW workflow
• While we cannot assume that Semantic Web experts have knowledge of RML,
XPath, and SPARQL Generate, we can definitely expect knowledge of SPARQL
Adaptable and extendable
• Facade-X can be extended, if needed (although we didn’t need to so far)
• SPARQL Anything is easy to extend to more formats:
• New transformers just need to produce the facade
• No major changes to the user experience (new format-specific options)
• Changes to the adapters don’t require changing the user-facing code
• In contrast to RML / SPARQL Generate which need to extend the user-facing
toolkit
25. Future work
• Study formal properties of Facade-X, e.g. prove generality, semantics of meta-model mappings, …
• User study comparing Mapping Language vs Facade based approach
• Methodology for developers to connect new data sources (e.g. GraphViz DOT, MIDI, …)
• Current approach is limited to serialised resources (aka files)
• In-memory transformation before query execution
• A triple filtering strategy can reduce memory requirements significantly
• Study strategies to cope with very large files (e.g. slicing)
• Study query-rewriting strategies, eventually rewriting mappings into efficient, iterator-based
transformers (mapping translation [Corcho 2020])
• Relational Database, No-SQL (e.g. mongoDB)
• Reuse existing approaches (e.g. Ontop) but hide complexity to the user - avoid the need for
configuration.
26. Get in touch!
SPARQL Anything is under active development!
https://sparql-anything.cc
GitHub: https://github.com/SPARQL-Anything/sparql.anything
enrico.daga@open.ac.uk
@enridaga
www.enridaga.net
27. • Daga, E., Asprino, L., Mulholland, P., Gangemi, A.: Facade-x: an opinionated approach to sparql anything. In: SEMANTiCS 2021: 17th
International Conference on Semantic Systems (2021)
• Atkin, M., Deely, T., Scharffe, F.: Knowledge Graph Benchmarking Report 2021 (version 2.0). Zenodo, http://doi.org/10.5281/zenodo.4950097 (June
2021)
• Lassila, O., Michael Schmidt, Brad Bebee, Dave Bechberger, Willem Broekema, Ankesh Khandelwal, Kelvin Lawrence, Ronak Sharda, and Bryan
Thompson: Graph? Yes! Which one? Help!. 1st Squaring the circle on knowledge graphs workshop - Semantics (2021)
• Daga, E., Meroño-Peñuela, A., Motta, E.: Sequential linked data: the state of affairs. Semantic Web (2021)
• Warren, P., Mulholland, P.: Using sparql–the practitioners’ viewpoint. In: European Knowledge Acquisition Workshop. pp. 485–500. Springer (2018)
• Corcho, O., Priyatna, F., Chaves-Fraga, D.: Towards a new generation of ontology based data access. Semantic Web 11(1), 153–160 (2020)
• Michel, F., Faron-Zucker, C., Corby, O., Gandon, F.: Enabling automatic discovery and querying of web apis at web scale using linked data standards.
In: Companion Proceedings of The 2019 World Wide Web Conference. pp. 883–892 (2019)
• Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: Rml: a generic language for integrated rdf mappings of
heterogeneous data. In: 7th Workshop on Linked Data on the Web (2014)
• García-González, H., Boneva, I., Staworko, S., Labra-Gayo, J.E., Lovelle, J.M.C.: Shexml: improving the usability of heterogeneous data mapping
languages for firsttime users. PeerJ Computer Science 6, e318 (2020)
• Ko, A.J., Abraham, R., Beckwith, L., Blackwell, A., Burnett, M., Erwig, M., Scaffidi, C., Lawrance, J., Lieberman, H., Myers, B., et al.: The state of the
art in enduser software engineering. ACM Computing Surveys (CSUR) 43(3), 1–44 (2011)
• Lefrançois, M., Zimmermann, A., Bakerally, N.: A sparql extension for generating rdf from heterogeneous formats. In: European Semantic Web
Conference. pp. 35– 50. Springer (2017)
• Lieberman, H., Paternò, F., Klann, M., Wulf, V.: End-user development: An emerging paradigm. In: End user development, pp. 1–8. Springer (2006)
• Cyganiak, Richard. Tarql (sparql for tables): Turn csv into rdf using sparql syntax. Technical Report, 2015. http://tarql. github. io, 2015.
References