Over the last two decades, the amount of data which has been created, published and managed using Semantic Web standards and especially via Resource Description Framework (RDF) has been increasing.
As a result, efficient processing of such big RDF datasets has become challenging.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
In this study, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets using a semantic-based partition and is implemented inside the state-of-the-art RDF processing framework: SANSA.
An evaluation of the performance of our approach in processing large-scale RDF datasets is also presented.
The preliminary results of the conducted experiments show that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system.
It is also comparable with the in-memory SPARQL query evaluators when there is less shuffling involved.
Cancer cell metabolism: special Reference to Lactate Pathway
Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation - SEMANTiCS 2019 talk
1. Towards A Scalable Semantic-based Distributed
Approach for SPARQL query evaluation
Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and Jens Lehmann
2. 2 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
❖ Introduction
❖ Approach
➢ System Architecture Overview
➢ Distributed Algorithm Description
❖ Evaluation
❖ Conclusion and Future work
Outline
3. 3 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
4. 4 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ As of August 2018
5. 5 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ Based on LOD Stats (http://lodstats.aksw.org/ )
~10, 000 datasets
Openly available online
using Semantic Web
standards
6. 6 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
➢ Based on LOD Stats (http://lodstats.aksw.org/ )
~10, 000 datasets
Openly available online
using Semantic Web
standards
many datasets
RDFized and kept private
(e.g. Supply chain,
manufacture, ethereum
dataset, etc.)
7. 7 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
8. 8 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Supply
Chain Mng.
Find relevant
information
about
motorcycle part
9. 9 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Media
Publishing
Discover what
others say
about news
stories
Supply
Chain Mng.
Find relevant
information
about
motorcycle part
10. 10 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Media
Publishing
Discover what
others say
about news
stories
Exploring
Tourism
Exploring the
neighborhood
(events, point
of interests,
etc)
Supply
Chain Mng.
Find relevant
information
about
motorcycle part
11. 11 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Media
Publishing
Discover what
others say
about news
stories
Exploring
Tourism
Exploring the
neighborhood
(events, point
of interests,
etc)
Entity
Linking
Which datasets
are good
candidates for
interlinking?
Supply
Chain Mng.
Find relevant
information
about
motorcycle part
12. 12 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Introduction
Source: LOD-Cloud (http://lod-cloud.net/ )
❖ Dealing with such amount of data makes many tasks hard to be
solved on single machines
Media
Publishing
Discover what
others say
about news
stories
Exploring
Tourism
Exploring the
neighborhood
(events, point
of interests,
etc)
Entity
Linking
Which datasets
are good
candidates for
interlinking?
Supply
Chain Mng.
Find relevant
information
about
motorcycle part
These processes require, both efficient storage
strategies and query processing engine.
13. 13 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
SANSA Overview
❖ SANSA [1] is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computations over
RDF large-scale datasets
❖ SANSA includes several libraries:
➢ Read / Write RDF / OWL library
➢ Querying library
➢ Inference library
➢ ML- Machine Learning core library
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
14. 14 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
Results
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview
15. 15 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
16. 16 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
System Architecture Overview (workflow)
SANSA Engine
RDF Layer
Data Ingestion
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
17. 17 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
18. 18 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
System Architecture Overview (workflow)
19. 19 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
System Architecture Overview (workflow)
20. 20 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
System Architecture Overview (workflow)
21. 21 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
System Architecture Overview (workflow)
22. 22 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
23. 23 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
24. 24 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
25. 25 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
#1
select (p=:owns)
26. 26 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
#1
select (p=:owns)
#2
select (p=:madeIn,
o=:Ingolstadt)
27. 27 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
#1
select (p=:owns)
#2
select (p=:madeIn,
o=:Ingolstadt)
#3
join (?c)
28. 28 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
#1
select (p=:owns)
#2
select (p=:madeIn,
o=:Ingolstadt)
#3
join (?c)
#4
project (?p)
29. 29 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
30. 30 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Approach
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
Results
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn ?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
System Architecture Overview (workflow)
31. 31 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
❖ Experimental Setup
➢ Cluster configuration
■ 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores),
128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8
➢ Datasets (all in nt format)
➢ Distributed SPARQL query evaluators we compare with:
■ SHARD [2], SPARQLGX-SDE [3], and Sparklify [4]
Evaluation
LUBM WatDiv
1K 2K 3K 10M 100M
#nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714
size (GB) 24 49 70 1.5 15
34. 34 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Runtime (s) (mean)
Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours)
C3 n/a 38.79 72.94 90.48
F3 n/a 38.41 74.69 n/a
L3 n/a 21.05 73.16 72.84
S3 n/a 26.27 70.1 79.7
C3 n/a 181.51 96.59 300.82
F3 n/a 162.86 91.2 n/a
L3 n/a 84.09 82.17 189.89
S3 n/a 123.6 93.02 176.2
Q1 774.93 103.74 103.57 226.21
Q2 fail fail 3348.51 329.69
Q3 772.55 126.31 107.25 235.31
Q4 988.28 182.52 111.89 294.8
Q5 771.69 101.05 100.37 226.21
Q6 fail 73.05 100.72 207.06
Q7 fail 160.94 113.03 277.08
Q8 fail 179.56 114.83 309.39
Q9 fail 204.62 114.25 326.29
Q10 780.05 106.26 110.18 232.72
Q11 783.2 112.23 105.13 231.36
Q12 fail 159.65 105.86 283.53
Q13 778.16 100.06 90.87 220.28
Q14 688.44 74.64 100.58 204.43
Performance analysis on large-scale RDF datasets
Evaluation
Watdiv-10MWatdiv-100MLUBM-1K
We can see that our
approach
(SANSA.Semantic)
evaluates most of the
queries as opposed to
SHARD. SHARD system
fails to evaluate most of
the LUBM queries and
its parser does not
support Watdiv queries.
35. 35 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Runtime (s) (mean)
Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours)
C3 n/a 38.79 72.94 90.48
F3 n/a 38.41 74.69 n/a
L3 n/a 21.05 73.16 72.84
S3 n/a 26.27 70.1 79.7
C3 n/a 181.51 96.59 300.82
F3 n/a 162.86 91.2 n/a
L3 n/a 84.09 82.17 189.89
S3 n/a 123.6 93.02 176.2
Q1 774.93 103.74 103.57 226.21
Q2 fail fail 3348.51 329.69
Q3 772.55 126.31 107.25 235.31
Q4 988.28 182.52 111.89 294.8
Q5 771.69 101.05 100.37 226.21
Q6 fail 73.05 100.72 207.06
Q7 fail 160.94 113.03 277.08
Q8 fail 179.56 114.83 309.39
Q9 fail 204.62 114.25 326.29
Q10 780.05 106.26 110.18 232.72
Q11 783.2 112.23 105.13 231.36
Q12 fail 159.65 105.86 283.53
Q13 778.16 100.06 90.87 220.28
Q14 688.44 74.64 100.58 204.43
Performance analysis on large-scale RDF datasets
Evaluation
Watdiv-10MWatdiv-100MLUBM-1K
On the other hand,
SPARQLGX-SDE
performs better than
both Sparklify and our
approach, when the
size of the dataset is
considerably small (e.g.
less than 25GB). This
behavior is due to the
large partitioning
overhead for Sparklify
and our approach.
36. 36 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Runtime (s) (mean)
Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours)
C3 n/a 38.79 72.94 90.48
F3 n/a 38.41 74.69 n/a
L3 n/a 21.05 73.16 72.84
S3 n/a 26.27 70.1 79.7
C3 n/a 181.51 96.59 300.82
F3 n/a 162.86 91.2 n/a
L3 n/a 84.09 82.17 189.89
S3 n/a 123.6 93.02 176.2
Q1 774.93 103.74 103.57 226.21
Q2 fail fail 3348.51 329.69
Q3 772.55 126.31 107.25 235.31
Q4 988.28 182.52 111.89 294.8
Q5 771.69 101.05 100.37 226.21
Q6 fail 73.05 100.72 207.06
Q7 fail 160.94 113.03 277.08
Q8 fail 179.56 114.83 309.39
Q9 fail 204.62 114.25 326.29
Q10 780.05 106.26 110.18 232.72
Q11 783.2 112.23 105.13 231.36
Q12 fail 159.65 105.86 283.53
Q13 778.16 100.06 90.87 220.28
Q14 688.44 74.64 100.58 204.43
Performance analysis on large-scale RDF datasets
Evaluation
Watdiv-10MWatdiv-100MLUBM-1K
However, Sparklify
performs better
compared to
SPARQLGX-SDE when
the size of the dataset
increases (see
Watdiv-100M results)
and the queries involve
more joins (see
LUBM-1K results). This
is due to the Spark SQL
optimizer and Sparqlify
self-joins optimizers.
37. 37 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Sizeup analysis (on LUBM dataset)
38. 38 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Sizeup analysis (on LUBM dataset)
Query execution
time for our
approach grows
linearly when the
size of the datasets
increases.
SHARD suffers from
the expensive
overhead of
MapReduce joins
which impact its
performance.
39. 39 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Sizeup analysis (on LUBM dataset)
Sparqlify and
SPARQLGX-SDE
performs better
when the size of the
datasets increases
as compared with
our approach and
SHARD.
40. 40 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Node scalability (on LUBM-1K)
41. 41 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Node scalability (on LUBM-1K)
We see that as the
number of nodes
increases, the
runtime cost of our
query engine
decreases linearly as
compared with the
SHARD, which keeps
staying constant.
42. 42 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Node scalability (on LUBM-1K)
SPARQLGX-SDE and
Sparklify perform
better when the
number of nodes
increases compared
to our approach and
SHARD
43. 43 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
❖ Processing large-scale RDF datasets:
➢ data-intensive and computing-intensive
➢ challenge to develop fast and efficient engine that can handle large scale
RDF datasets
❖ A scalable semantic-based query engine (integrated into the larger
SANSA framework) for efficient evaluation of SPARQL queries over
distributed RDF datasets using the Spark framework
❖ As a future work we plan to expand our parser and add statistics to
the query engine while evaluating queries
Conclusion
45. 45 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
[1]. J. Lehmann, G. Sejdiu, L. B¨uhmann, P. Westphal, C. Stadler, I. Ermilov, S. Bin, N. Chakraborty, M. Saleem,
A.-C. N. Ngonga, and H. Jabeen. Distributed Semantic Analytics using the SANSA Stack. In Proceedings of 16th
International Semantic Web Conference - Resources Track (ISWC’2017), 2017.
[2]. K. Rohloff and R. E. Schantz. High-performance, massively scalable distributed systems using the
mapreduce software framework: The shard triple-store. In Programming Support Innovations for Emerging
Distributed Applications, PSI EtA ’10, pages 4:1–4:5, New York, NY, USA, 2010. ACM.
[3]. D. Graux, L. Jachiet, P. Genev`es, and N. Layaida. Sparqlgx: Efficient distributed evaluation of sparql with
apache spark. In P. Groth, E. Simperl, A. Gray, M. Sabou, M. Kr¨otzsch, F. Lecue, F. Fl¨ock, and Y. Gil, editors,
The Semantic Web – ISWC 2016, pages 80–87, Cham, 2016.
[4]. C. Stadler, G. Sejdiu, D. Graux, and J. Lehmann. Sparklify: A Scalable Software Component for Efficient
evaluation of SPARQL queries over distributed RDF datasets. In Proceedings of 18th International Semantic
Web Conference (ISWC’2019), 2019.
References
46. 46 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Backup Slides
47. 47 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn
Evaluation
Overall analysis of queries on LUBM-1K dataset (cluser mode)