The SPARQL Anything project

The SPARQL Anything project
Enrico Daga and Luigi Asprino
The Web Conference - Developers Track
22/04/2021 - online @enridaga

Background
• Semantic Web developers always concerned with methods to
“lift” legacy content to RDF:
• Targeting specific types/formats: SPARQL Microservices
[Michel, 2019], Tarql, Any23, JSON2RDF, CSV2RDF
• Mapping languages, several types of (e.g. RML,
ShexML): high learning demands. [Dimou, 2014]
[García-González, 2020]
• SPARQL Generate: learning demands, difficult to extend
to other formats. [Lefrançois, 2017]
• Solutions transfer data source complexity to the user (e.g.
know XPath for XML, JsonPath for JSON, …)
• End-user development [Lieberman, 2006]. Many SPARQL
users fall into the category of end-user developer. In a recent
survey, 42% SPARQL users are from non-IT areas,
including social sciences and the humanities, business and
economics, and biomedical, engineering or physical sciences.

SPICE
Social Cohesion, Participation and Inclusion
through Cultural Engagement
Polifonia
Digital Harmoniser of Musical Cultural Heritage
-
Cultural Heritage Knowledge Graphs
-
Sources in different formats
x
Multiple / unknown ontologies
=
Duplication of effort!!!
https://spice.kmi.open.ac.uk/
http://spice-h2020.eu
https://polifonia-project.eu/
This project has received funding from the European
Union’s Horizon 2020 research and innovation
programme

Knowledge Graph Construction
Composite process:
• Observe: the data source (e.g. a CSV file)
• Map: develop mappings to a target ontology
• Triplify: run the mappings and evaluate the result
• (many iterations)
KG construction is a twofold job:
• perform a syntax/structure conversion (e.g. from CSV to RDF)
• project semantics onto the data (applying a domain ontology)

Concept
… twofold job:
• perform a syntax/structure conversion -> Re-engineering
• We want to solve this problem once and for all
• project semantics onto the data (applying a domain ontology) -> Re-modelling
• We leave this to the end user, powered by SPARQL 1.1
• Approach: design a single RDF facade for any data format
• Re-engineering
• Focus on the syntax and the meta-model (data structure)
• Leave data as much as possible as-it-is!
• apply the least possible “ontological commitment”
https://en.wikipedia.org/wiki/Facade_pattern

An RDF Facade?
Problem Space
• CSV
• JSON
• HTML
• XML
• Binary (JPEG, PNG, …)
• Text
Solution Space
• https://www.w3.org/TR/rdf11-concepts/
• https://www.w3.org/TR/rdf-schema/
rdf:type, rdf:Property, rdfs:label,
rdfs:Resource, rdfs:Class, rdf:Bag,
rdfs:Container, rdf:List, RDF Dataset,
Graph, …
Facade-X: (to be filled by picking and mixing from the solution space)
Ups! We are facing the same old problem … only this time we don’t care about the content
(domain) and we only focus on the format and data structure (meta-model)

CSV
Facade: http://sparql.xyz/facade-x/ns/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix fx: <http://sparql.xyz/facade-x/ns/>.
@prefix xyz: <http://sparql.xyz/facade-x/data/>.
rdf:Property a rdfs:Class .
rdfs:ContainerMembershipProperty
rdfs:subClassOf rdf:Property .
fx:Root a rdfs:Class .
id,name,gender,dates,yearOfBirth,yearOfDeath,placeOfBirth,placeOfDeath,url
10093,"Abakanowicz, Magdalena",Female,born 1930,1930,,Polska,,http://www.tate.org.uk/art/artists/magdalena-abakanowicz-10093
…
https://github.com/tategallery/collection/blob/master/artist_data.csv
[ a fx:root ;
rdf:_1 [ xyz:dates "born 1930" ;
xyz:gender "Female" ;
xyz:id "10093" ;
xyz:name "Abakanowicz, Magdalena" ;
xyz:placeOfBirth "Polska" ;
xyz:placeOfDeath "" ;
xyz:url "http://www.tate.org.uk/art/artists/magdalena-
abakanowicz-10093" ;
xyz:yearOfBirth "1930" ;
xyz:yearOfDeath ""
] ;
csv.headers=true|false
[ a fx:root ;
rdf:_1 [ rdf:_1 "id" ;
rdf:_2 "name" ;
rdf:_3 "gender" ;
rdf:_4 "dates" ;
rdf:_5 "yearOfBirth" ;
rdf:_6 "yearOfDeath" ;
rdf:_7 "placeOfBirth" ;
rdf:_8 “placeOfDeath" ;
rdf:_9 "url"
] ;
CSV
JSON
HTML
XML
Binary (JPEG, PNG, …)
Text
@enridaga

JSON
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
xsd:int a rdfs:Datatype.
xsd:string a rdfs:Datatype.
xsd:boolean a rdfs:Datatype.
xsd:decimal a rdfs:Datatype.
xsd:float a rdfs:Datatype.
xsd:double a rdfs:Datatype.
https://github.com/tategallery/collection/artworks/t/023/t02319-9205.json
[ a fx:root ;
xyz:acno "T02319" ;
xyz:acquisitionYear "1978"^^<http://www.w3.org/2001/XMLSchema#int> ;
xyz:all_artists "Kazimir Malevich" ;
xyz:catalogueGroup [] ;
xyz:classification "painting" ;
xyz:contributorCount "1"^^<http://www.w3.org/2001/XMLSchema#int> ;
…
{
"acno": "T02319",
"acquisitionYear": 1978,
"all_artists": "Kazimir Malevich",
"catalogueGroup": {},
"classification": "painting",
"contributorCount": 1,
"contributors": [
{
CSV
JSON
HTML
XML
Text

DOM (HTML, XML, …)
rdf:type rdf:type rdf:Property
https://imma.ie/artists/
[ a fx:root , xhtml:div ;
xhtml:id “az-group” ;
rdf:_1 [ a xhtml:div ;
rdf:_1 [ a xhtml:h4 ;
rdf:_1 "A" ;
<https://html.spec.whatwg.org/#innerHTML>
"A" ;
<https://html.spec.whatwg.org/#innerText>
"A"
] ;
…
html.selector=#az-group
@prefix xhtml: <http://www.w3.org/1999/xhtml#> .
CSV
JSON
HTML
XML
Text

Binary and Text
xsd:base64Binary a rdfs:Datatype.
rdf:type df:type rdf:Property
[ <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> “/9j/
4AAQSkZJRgABAQEASABIAAD/
4QmsRXhpZgAASUkqAAgAAAALAA8BAgAGAAAAkgAAABABAgAOAAAAmAAAABIBAw
ABAAAAAQAAABoBBQABAAAApgAAABsBBQABAAAArgAAACgBAwABAAAAAgAAADEB
AgALAAAAtgAAADIBAgAUAAAAwgAAABMCAwABAAAAAgAAAGmHBAABAAAA1gAAAC
WIBAABAAAA0gMAAOQDAABDYW5vbgBDYW5vbiBFT1MgNDBEAEgAAAABAAAASAAA
AAEAAABHSU1QIDIuNC41AAAyMDA4OjA3OjMxIDEwOjM4OjExAB4Am…”^^<http
://www.w3.org/2001/XMLSchema#base64Binary> ] .
bin.encoding # BASE64
txt.regex # tokenise into a sequence
CSV
JSON
HTML
XML
Text
https://imma.ie/collection/freeing-the-voice/
Hello World! [ <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> "Hello World!" ] .

https://
github.com/
SPARQL-
Anything/
showcase-tate
Assumption: SPARQL
1.1 CONSTRUCT
queries will be enough
to design mappings (the
re-modelling phase)

https://github.com/
SPARQL-Anything/
showcase-imma
https://imma.ie/collection/freeing-the-memory/

Preliminary feedback
• From 27 users, diverse SPARQL expertise
• Essential or very important
• the system should minimise the languages or syntaxes needed
• mappings should be easy to read and interpret
• the system must be easy to learn for a Semantic Web practitioner
• the system is able to support new types of data sources without changes to the mapping language
• How easy is this code to understand
(comparing equivalent mappings)?
• (a) RML
• (b) SPARQL Generate
• (c) SPARQL Anything

Benefits
• Transform / Query resources having heterogeneous formats
• Low learning demands (plain SPARQL 1.1)
• Minimise complexity of the mappings
• A single+consistent abstraction for any data format
• Enable data exploration in the absence of a domain ontology
• Integrate with a typical Semantic Web engineering workflow
• Flexible and adaptable (Facade-X can be extended, if needed)
• Easy to extend:
• new transformers just need to return the facade
• no major changes to the user experience

Challenges
• No commitment on the internal machinery! (It is a gift and a curse …)
• Current version v0.1.1 (we started Nov 2020):
• implemented on top of Apache Jena ARQ
• limited to files
• loads the triples in-memory and then performs the query
• A triple filtering strategy reduces in-memory dataset
• Very large files require very large memory
• Next: to develop strategies to cope with large resources (e.g. slicing)
• Next: to develop query-rewriting strategies, eventually rewriting mappings into efficient,
iterator-based transformers (mapping translation [Corcho 2020])
• Next: Relational Database, No-SQL (e.g. mongoDB)
• Reuse existing approaches (e.g. OBDA) but hide complexity to the user

Get in touch!
SPARQL Anything is under active development
https://github.com/SPARQL-Anything/sparql.anything
enrico.daga@open.ac.uk
@enridaga
www.enridaga.net

References
• Daga, E., Asprino, L., Mulholland, P., Gangemi, A.: Facade-x: an opinionated approach to sparql anything (submitted). In: SEMANTiCS 2021:
17th International Conference on Semantic Systems (2021)
• Daga, E., Meroño-Peñuela, A., Motta, E.: Sequential linked data: the state of affairs. Semantic Web (2021)
• Warren, P., Mulholland, P.: Using sparql–the practitioners’ viewpoint. In: European Knowledge Acquisition Workshop. pp. 485–500. Springer
(2018)
• Corcho, O., Priyatna, F., Chaves-Fraga, D.: Towards a new generation of ontology based data access. Semantic Web 11(1), 153–160 (2020)
• Michel, F., Faron-Zucker, C., Corby, O., Gandon, F.: Enabling automatic discovery and querying of web apis at web scale using linked data
standards. In: Companion Proceedings of The 2019 World Wide Web Conference. pp. 883–892 (2019)
• Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: Rml: a generic language for integrated rdf mappings
of heterogeneous data. In: 7th Workshop on Linked Data on the Web (2014)
• García-González, H., Boneva, I., Staworko, S., Labra-Gayo, J.E., Lovelle, J.M.C.: Shexml: improving the usability of heterogeneous data
mapping languages for firsttime users. PeerJ Computer Science 6, e318 (2020)
• Ko, A.J., Abraham, R., Beckwith, L., Blackwell, A., Burnett, M., Erwig, M., Scaffidi, C., Lawrance, J., Lieberman, H., Myers, B., et al.: The state
of the art in enduser software engineering. ACM Computing Surveys (CSUR) 43(3), 1–44 (2011)
• Lefrançois, M., Zimmermann, A., Bakerally, N.: A sparql extension for generating rdf from heterogeneous formats. In: European Semantic Web
Conference. pp. 35– 50. Springer (2017)
• Lieberman, H., Paternò, F., Klann, M., Wulf, V.: End-user development: An emerging paradigm. In: End user development, pp. 1–8. Springer
(2006)
• Cyganiak, Richard. Tarql (sparql for tables): Turn csv into rdf using sparql syntax. Technical Report, 2015. http://tarql. github. io, 2015.

The SPARQL Anything project

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to The SPARQL Anything project

Similar to The SPARQL Anything project (20)

More from Enrico Daga

More from Enrico Daga (16)

Recently uploaded

Recently uploaded (20)

The SPARQL Anything project