Optimizing SPARQL Queries with SHACL.pdf

SIRIUS SEMINAR
Ratan Bahadur Thapa
PhD candidate at SIRIUS (IFI)
University of Oslo
October 19, 2023

About
Full paper: https://www.duo.uio.no/handle/10852/103167

RDF
▶ Standard for web data
▶ W3C Rec. since 1999
▶ RDF 1.0, 2004 https://www.w3.org/TR/rdf-primer/
▶ RDF 1.2, 2014 https://www.w3.org/TR/rdf11-concepts/
▶ W3C working draft for RDF 1.2, 2023 https://www.w3.org/TR/rdf12-concepts/

RDF Syntax
▶ IRIs to reference resources on web
▶ Statements as nodes and arcs in a graph, in the form of triples
”(Subject, Predicate, Object)”. E.g.,
”Mona Lisa has a creator whose value is Leonardo Da Vinci”
http://purl.org/dc/terms/creator
https://en.wikipedia.org/wiki/Mona_Lisa
https://en.wikipedia.org/wiki/Leonardo_da_Vinci
Subject
Predicate
Object

RDF Graph
▶ Composed of triples ”(Subject, Predicate, Object)”

RDF: Syntactic shortcuts
Turtle Syntax:
BASE ⟨http : //example.org⟩
PREFIX foaf: ⟨http : //xmlns.com/foaf /0.1/⟩
PREFIX dcterms: . . .
PREFIX wd: . . .
⟨bob#me⟩
a foaf:Person;
foaf:Knows ⟨alice#me⟩;
schema:birthdate ”1990-07-04”xsd:date;
foaf:topic interest wd:Q12418.
wd:Q12418 dcterms:title ”Mona Lisa”;
. . .

RDF: Constraints?
W3C defines RDF as an ”assertional logic,” where each triple
expresses a simple proposition.
▶ This logical framework imposes a strict monotonic discipline
on the language, preventing the expression of closed-world
assumptions, local default preferences, and other commonly
used non-monotonic constructs.

SHACL
▶ Constraint language for RDF
▶ W3C Rec. since July 2017
Other constraint languages:
▶ SPIN - SPARQL Syntax, (2009) 2011
https://www.w3.org/submissions/2011/SUBM-spin-sparql-20110222/
▶ IBM Resource Shape 2.0, 2014 https://www.w3.org/submissions/shapes/
▶ Shape Expressions Language 2.0, 2017, http://shex.io/shex-semantics-20170713/

SHACL
▶ relies on the notion of ”shapes”
e.g.,
:EmployeeNode a sh:NodeShape;
sh:targetClass :Employee;
sh:property [ sh:path :hasAddress;
sh:nodeKind sh:Literal;
sh:maxCount 1; sh:minCount 1;
sh:datatype xsd:string ];
dash:uniqueValueForClass
:Employee ].

SHACL Shape
▶ relies on the notion of ”shapes”
e.g.,
:EmployeeNode a sh:NodeShape ;
sh:targetClass :Employee ;
sh:property [ sh:path :hasAddress ;
sh:nodeKind. sh:Literal ;
sh:maxCount 1;
sh:minCount 1;
:Employeee ].
shape name
target defn
constraints
defn

SHACL: Constraint Validation
Consider an RDF graph on the left and a SHACL shape on the right, written in
Turtle syntax:
:Ida a :Employee;
:hasID "001"^^xsd:int;
:hasAddress "Oslo".
:Ingrid a :Employee;
:hasAddress "Bergen".
:Employee ].

Acquiring Target nodes:
:Ida a :Employee;
:hasAddress "Oslo".
:Employee ].

Checking compliance of Target nodes against Constraints : VALID
:Ida a :Employee;
:hasAddress "Oslo".
:Employee ].

SHACL: Propagated Constraint Validation
sh:node :AddressNode ].
:AddressNode a sh:NodeShape;
sh:property [ sh:path :telephone;
sh:maxCount 1; ];
sh:property [ sh:path :locatedIn;
sh:value :NorthernNorway; ];

SHACL: Propagated Constraint - ”Recursion”?
sh:node :AddressNode ];
sh:property [ sh:path :knows;
sh:minCount 1;
sh:node :EmployeeNode ].
sh:maxCount 1; ];
sh:value :NorthernNorway; ].

SHACL: Propagated Constraint - ”Recursion”?
sh:minCount 1;
sh:maxCount 1; ];
From : https://www.w3.org/TR/shacl/
Recursion
Not 100% formal semantics
Validation explicitly left undefined

SHACL: Propagated Constraint - ”Recursion”
sh:minCount 1;
sh:maxCount 1; ];
Contributions:
Abstract syntax of SHACL core
Semantics for recursive SHACL
Validation algorithms and Tractable fragments

SHACL: Abstract Syntax
Let S, C and P be countable infinite and mutually disjoint sets of
shape, class and property names.
Shape target τs and constraint ϕs are expressions defined by
the grammar
τs := sh:targetClass C | sh:targetSubjectOf P |
sh:targetObjectOf P
ϕs := ≥n α. β | ≤n α. β | ▷τs α | α1 = α2 | ϕs ∧ ϕs
β := ⊤ | C | s′
| ¬β
Where α, α1, α2 ∈ {P ∪ {P− | P ∈ P}}, C ∈ C and s, s′ ∈ S.

SHACL: Abstract Syntax
Shape target τs and constraint ϕs are expressions defined by the
grammar:
τs := sh:targetClass C | sh:targetSubjectOf P |
sh:targetObjectOf P
τs := C | P | P−
(i.e., short syntax)
β := ⊤ | C | s′
| ¬β
A shape in abstract syntax:
⟨Employee, τEmployee, ϕEmployee⟩ with τEmployee = :Employee and
ϕEmployee = (=1 hasAddress. ⊤) ∧ (▷τEmployee
hasAddress).

SHACL
Shape target τs and constraint ϕs are expressions defined by the
grammar:
τs := C | P | P−
(i.e., short syntax)
β := ⊤ | C | s′
| ¬β
A shape in abstract syntax:
⟨Employee, τEmployee, ϕEmployee⟩ with τEmployee = :Employee and
ϕEmployee = (=1 hasAddress. ⊤) ∧ (▷τEmployee
hasAddress).
Once the context is clear, we simply write:
⟨Employee, :Employee, (=1 hasAddress. ⊤) ∧ (▷τEmployee
hasAddress)⟩

SPARQL
▶ Query language for RDF
▶ W3C Rec. since January
2008
▶ SPARQL 1.1, 2013 https://www.w3.org/TR/sparql11-query/
▶ W3C working draft for SPARQL 1.2, 2023 https://www.w3.org/TR/sparql12-update/
▶ W3C community draft for RDF∗
and SPARQL∗
, 2021 https://www.w3.org/2021/12/rdf-star.html

SPARQL: Query Variables?
▶ For Queries we need variables, and SPARQL Variables are
bound to RDF terms
▶ E.g., ?title, ?author, ?published
▶ In the same way as SQL,
A Query for variables is performed via SELECT statement
▶ E.g., SELECT ?title ?author ?published
A SELECT statement returns Query Result as a table
?title ?author ?published
Games of no
chance
Richard J.
Nowakowski
1999
Calculated Bets Steven S. Skiena 2001
▶ Bag Semantics

SPARQL: Evaluation?
▶ Consider a SPARQL query:
SELECT DISTINCT ?article ?author ?affiliation
WHERE {
?article rdf:type :Article;
dc:creator ?author .
?author dc:affiliated ?affiliation .
FILTER (contains (?affiliation, "University of Oslo"))

SPARQL: Bottom-up
Basic Graph Pattern (BGP) Matching
?article rdf:type :Article; dc:creator ?author .
?author dc:affiliated ?affiliation .
Intermediate Operators: FILTER
contains (?affiliation, "University of Oslo")
Intermediate Operators: PROJECTION
?article ?author ?affiliation
Intermediate Operators: DISTINCT
Final Query Result

SPARQL Algebra
SPARQL query is a graph pattern P defined by the grammar
P := B | FilterF (P) | Union(P1, P2) | Join(P1, P2) | Minus(P1, P2)
| DiffF (P1, P2) | OptF (P1, P2) | ProjL(P) | Dist(P)
E.g. Consider a case of nested SPARQL query that retrieves the
name of employees and their office addresses,
SELECT ?y ?z WHERE { ?x :hasName ?y
SELECT ?x ?z WHERE { ?x :hasOffice ?y . ?y :hasAddress ?z }}
In SPARQL Algebra,
Projyz (Join(hasName(x, y), Projxz (Join(hasOffice(x, y),
hasAddress(y, z)))))

SPARQL Algebra
E.g. Consider a case of nested SPARQL query that retrieves the
name of employees and their office addresses,
SELECT ?y ?z WHERE { ?x :hasName ?y
SELECT ?x ?z WHERE { ?x :hasOffice ?y . ?y :hasAddress ?z }}
In SPARQL Algebra,
Projyz (Join(hasName(x, y), Projxz (Join(hasOffice(x, y),
hasAddress(y, z)))))
Upon simplification (whenever possible but absolutely not necessary), we
get:
Projyz (hasName(x, y) hasOffice(x, n) hasAddress(n, z)) .

SPARQL Algebra: some notions on query evaluation?
The semantics of graph patterns is defined in terms of (solution)
mappings, partial functions,
µ : V → T with (possibly empty) dom(µ)
where T is sets of RDF terms I ∪ B ∪ L, and V countably infinite
set of variables disjoint from T.

SPARQL Algebra: some notions on query evaluation?
Partial functions,
µ : V → T
Let
▶ µ|L be the restriction of mapping µ to L ⊆ V
▶ µ|L̄ be the restriction of mapping µ to V L
Evaluation of a SPARQL query Q over an RDF graph G, denoted
by QG, returns a multiset (i.e.,bag) of mappings.
▶ QG
|X̄ is the multiset of mappings µ ∈ QG restricted to V X
i.e.,
|µ, QG
|X̄ | =
X
µ=µ′
|X̄
|µ′
, QG
|
▶ Support of the multiset QG, denoted by sup(QG), is
sup(QG
) = {µ | |µ, QG
| > 0}

Next
SPARQL queries optimizations with SHACL

Optimization: Problem Statement
Let S be a set of SHACL shapes, and Q a SPARQL query.
Our goal is to find optimal S-equivalent queries Q′ of the original
query Q s.t.,
Q ≡S Q′
iff, ∀G.G |= S, QG
= Q′G

Optimization: Rewriting Rules
Let U and V be two graph patterns, and S a set of SHACL shapes.
U ≡S V iff, ∀G.G |= S, UG
= V G
U ≡S,y V iff, ∀G.G |= S, UG
|ȳ = V G
|ȳ
U ∼
=S,y V iff, ∀G.G |= S, sup(UG
|ȳ ) = sup(V G
|ȳ )
We then propose a set of query rewriting rules based on these
equivalences that:
1. reduce OPTIONAL to JOIN Pattern
2. remove redundant JOIN Pattern
3. eliminate DIST Operator etc

An Example of Query Rewriting
Consider a SPARQL query,
Dist(Projxy (Opt⊤(Employee(x), hasAddress(x, y))))
over graph G,
:Ida a :Employee;
:hasAddress "Oslo".
:Yacob a :Employee;
. . .
:Nils a :Employee;
. . .
. . .

Consider a SPARQL query,
Dist(Projxy (Opt⊤(Employee(x), hasAddress(x, y))))
over graph G,
:Ida a :Employee;
:hasAddress "Oslo".
:Yacob a :Employee;
. . .
. . .
Assume G satisfies shape,
⟨Employee, :Employee, (=1 hasAddress. ⊤)∧(▷τEmployee
hasAddress)⟩

Consider the query over G,
Dist(Projxy (Opt⊤(Employee(x), hasAddress(x, y)))) .
▶ Since G satisfies ϕEmployee = (=1 hasAddress. ⊤), “Opt
pattern” can be reduce to “Join pattern”
Dist(Projxy (Join(Employee(x), hasAddress(x, y)))).
▶ Since G satisfies ϕEmployee = (▷τEmployee
hasAddress), “Dist”
can be removed,
Projxy (Join(Employee(x), hasAddress(x, y))).

Consider the query over G,
Dist(Projxy (Opt⊤(Employee(x), hasAddress(x, y)))) .
▶ Since G satisfies ϕEmployee = (=1 hasAddress. ⊤), “Opt
pattern” can be reduce to “Join pattern”
Dist(Projxy (Join(Employee(x), hasAddress(x, y)))).
▶ Since G satisfies ϕEmployee = (▷τEmployee
hasAddress), “Dist”
can be removed,
Projxy (Join(Employee(x), hasAddress(x, y))).
“≡S Equivalent Queries”

Optimization: Example of Rewriting Rules
Lemma
Let ⟨s, τs, ϕs⟩ ∈ S with (≥n P.⊤) ∈ ϕs s.t. n ≥ 1, and P a graph
pattern s.t. T ◀ P. If y /
∈ var(P), then
1. OptF (P, P(x, y)) ≡S FilterF (Join(P, P(x, y)))
2. Join(P, P(x, y)) ∼
=S,y P
where T =



C(x), if τs = C,
R(x, z), if τs = ∃R,
R−(x, z), if τs = ∃R− .
Corollary
∈ var(P ∪ F), then
1. FilterF (Join(P, P(x, y))) ∼
=S,y FilterF (P)
2. OptF (P, P(x, y)) ∼
=S,y FilterF (P)

T =



C(x), if τs = C,
R(x, z), if τs = ∃R,
R−(x, z), if τs = ∃R− .
Corollary
FilterF (Join(P, P(x, y))) ∼
=S,y FilterF (P) .
Q = Dist(Projx y (Filterregex(y,”Smith”)(Join(Student(x)
lastName(x, y), hasAddress(x, z)))))
Consider the Q over G |= ⟨Student, :Student, (≥n hasAddress ⊤)⟩.

Corollary
FilterF (Join(P, P(x, y))) ∼
=S,y FilterF (P) .
Q = Dist(Projx y (Filterregex(y,”Smith”)(Join(Student(x)
lastName(x, y), hasAddress(x, z)))))
Consider the Q over G |= ⟨Student, :Student, (≥n hasAddress ⊤)⟩.
Then, by following Corollary, we can reduce query Q to :
Dist(Projx y (Filterregex(y,”Smith”)(Student(x)
lastName(x, y))))

Property of Query Rewriting Rules
▶ Propagation to Larger Queries
▶ Confluent Reduction

Property of Query Rewriting Rules: Propagation
Definition
Let Q be a SPARQL query, and let P and U be two graph patterns.
Then, we write U ◁
∼ Q if Dist(ProjX (P)) ⊴ Q and U ⊴ P .
Theorem
Let Q be a SPARQL query and S a SHACL document. Let U and
V be two graph patterns. Then,
1. Q ≡S QU7→V if U ≡S V
2. ProjX (Q) ≡S ProjX (Q)U7→V if U ⊴ Q, U ≡S,y V and
y /
∈ var(ProjX (Q) U)
3. Q ≡S QU7→V if U ◁
∼ Q, U ∼
=S,y V and y /
∈ var(Q U)

Property of Query Rewriting Rules: Confluent Reduction
Consider the SPARQL query,
Dist(Projx y (employeeID(x, y) hiredBy(x, k) insuredBy(x, z)))
over a graph G s.t. G |= ⟨∃employeeID, τ∃employeeID, ϕ∃employeeID⟩
with
{(▷∃employeeID employeeID), (=1 insuredBy. ⊤),
(hiredBy = insuredBy)} ⊆ ϕ∃employedID .

over a graph G s.t. G |= ⟨∃employeeID, τ∃employeeID, ϕ∃employeeID⟩ with
(hiredBy = insuredBy)} ⊆ ϕ∃employedID
Subsequently,
(=1 insuredBy. ⊤)∧(hiredBy = insuredBy)
−→ (=1 hiredBy. ⊤)
(=1 hiredBy. ⊤) −→ (≥1 hiredBy. ⊤)
(=1 insuredBy. ⊤) −→ (≥1 insuredBy. ⊤)
Need to take-care all explicit and implicit rewritings rules

over a graph G s.t. G |= ⟨∃employeeID, τ∃employeeID, ϕ∃employeeID⟩ with
, (hiredBy = insuredBy)} ⊆ ϕ∃employedID
Then, the query is subjective to the following rewriting rules:
1. ∼
=S,y - ”Join” optimization based on (≥1 insuredBy. ⊤)
2. ∼
=S,y - ”Join” optimization based on (≥1 hiredBy. ⊤)
3. ∼
=S,y - ”Join” optimization based on (hiredBy = insuredBy)
4. ≡S,y - ”Join” optimization based on (=1 insuredBy. ⊤)
5. ≡S,y - ”Join” optimization based on (=1 hiredBy. ⊤)
6. ≡S - ”Dist” optimization based on (▷∃employeeID employeeID)

over a graph G s.t. G |= ⟨∃employeeID, τ∃employeeID, ϕ∃employeeID⟩.
Then, the query is subjective to the following rewriting rules:
1. ∼
=S,y - ”Join” optimization based on (≥1 insuredBy. ⊤)
2. ∼
=S,y - ”Join” optimization based on (≥1 hiredBy. ⊤)
3. ∼
=S,y - ”Join” optimization based on (hiredBy = insuredBy)
4. ≡S,y - ”Join” optimization based on (=1 insuredBy. ⊤)
5. ≡S,y - ”Join” optimization based on (=1 hiredBy. ⊤)
6. ≡S - ”Dist” optimization based on (▷∃employeeID employeeID)
Regardless of the sequence in which these rewrites are applied, we will get:
Projx y (employeeID(x, y))

... As rewriting optimizations are generalized in the form of lemmas and their
consequences, we state confluent results as follows:
Theorem
Query rewriting defined by Lemmas 1 to 6 is a confluent reduction.
Theorem
Query rewriting defined by Lemmas 1 to 7 is a confluent reduction iff
ϕ′
=

⊤, if P = T,
Vn
i=1(=1 Pi . ⊤), if P = (T P1(x, z1) . . . Pi (x, zi ) . . . Pn(x, zn))
in
Lemma 7.

Other or future work?
▶ Extension to SPARQL Property Path Queries
▶ Optimization of Ontology-Mediated Query Answering

Optimizing SPARQL Queries with SHACL.pdf

Recommended

Recommended

More Related Content

Similar to Optimizing SPARQL Queries with SHACL.pdf

Similar to Optimizing SPARQL Queries with SHACL.pdf (20)

Recently uploaded

Recently uploaded (20)

Optimizing SPARQL Queries with SHACL.pdf