Efficient Distributed
In-Memory Processing of
RDF Datasets
Gezim Sejdiu
PhD Colloquium, Bonn 29.09.2020
Supervisor: Prof. Dr. Jens Lehmann
Introduction
Large-Scale RDF Dataset Statistics
Quality Assessment of RDF Datasets at Scale
Scalable RDF Querying
Use Cases and Applications
Conclusion & Future Directions
Outline
2
Introduction
Get me there!
3
No single definition
Extremely large data sets that may be analysed computationally to
reveal patterns, trends, and associations, especially relating to human
behaviour and interactions
Big data is a term for data sets that are so large or complex that
traditional data processing application softwares are inadequate to deal
with them
What is Big Data?
4
It’s relevance is increasing drastically and Big Data Analytics is an
emerging field to explore
Why ‘BigData’ is so important?
5
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
6
7
© Sorpresa meme on Memegen
Big Data Europe (BDE) Platform
8https://github.com/big-data-europe
Support Layer
Init Daemon
GUIs
Monitor
App Layer
Traffic
Forecast
Satellite Image
Analysis
Platform Layer
Spark Flink Semantic Layer
Ontario SANSA Semagrow
Kafka
Real-time Stream
Monitoring
...
...
Resource Management Layer (Swarm)
Hardware Layer
Premises Cloud (AWS, GCP, MS Azure, …)
Data Layer
Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
Fast and generic-purpose cluster computing engine
Apache Spark
9
Spark Core Engine (RDD)
Deploy
SparkSQL&
DataFrames
CoreAPIs&
Libraries
SparkStreaming
Local
Single
JVM
Cluster
(Standalone,
Mesos, YARN)
Containers
docker-comp
ose
MLlib
MachineLearning
GraphX
Graphprocessing
Allows for massive parallel processing of
collections of records
- RDD - Resilient Distributed Dataset
- DataFrame - Conceptually a table
- Dataset - Unified access to data as objects
and/or tables
Heterogeneity aka Variety
Key Observation From BDE
10
Banking
Finance
Our
Known
History
PurchaseEntertain
Gaming
Social
Media
VISA
CHASE
SAP
IBM
NORDSTROM
Amazon
LOWES
NETFLIX
HULU
NFb NETWORK
Zynga
XBOX 360
Facebook
Pinterest
Twitter
Customer
Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
11
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
Modelling entities and their relationships
Analysis: finding underlying structure of the graph e.g. to predict
unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO,
Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData
Knowledge Graphs
12
Knowledge Graphs are everywhere
13
Entity Search and Summarization
Discovering Related Entities
Tasks that are hard to solve on single machines (>1 TB memory
consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG, StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g. DBpedia →
LinkedGeoData
Why Distributed RDF Data Processing?
14
Main Research Question
Is it possible to process large-scale RDF
datasets efficiently and effectively?
15
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
16
RC1: A Scalable Distributed Approach for Computation of RDF Dataset
Statistics
RC2: A Scalable Framework for Quality Assessment of RDF Datasets
RC3: A Scalable Framework for SPARQL Evaluation of Large RDF Data
Contributions
17
SANSA
Scalable Semantic Analytics Stack
18
SANSA [1] is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computation over
large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
19
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
20
Large-Scale RDF Dataset
Statistics
A Scalable Distributed Approach for
Computation of RDF Dataset Statistics [2]
21
Obtaining an overview over the Web of Data, it is important to gather
statistical information describing characteristics of the internal
structure of datasets
This process is both data-intensive and computing-intensive and it is a
challenge to develop fast and efficient algorithms that can handle large
scale RDF datasets
There are no approaches for RDF that computes those statistical criteria
and scales to large data sets
Motivation
22
A statistical criterion C is a triple C = (F, D, P), where:
- F is a SPARQL filter condition
- D is a derived dataset from the main dataset (RDD of triples) after
applying F
- P is a post-processing operation on the data structure D
RDDs are in-memory collections of records that can be operated in
parallel on large clusters
- We use RDDs to represent RDF triples
Approach
23
Architecture Overview
24
Experimental Setup
- Cluster configuration
- 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5
- Spark-2.2.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8
- Datasets (all in nt format)
Evaluation
25
DBpedia BSBM
LinkedGeoData en de fr 2GB 20GB 200GB
#nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057
size (GB) 191.17 114.4 48.6 49.77 2 20 200
Distributed Processing on Large-Scale Datasets
* e) = d) / c) - 1
Evaluation
26
Runtime (in hours)
LODStats DistLODStats
a) files
b)
bigfile c) local d) cluster
e) speedup
ratio
LinkedGeoData n/a n/a 36.65 4.37 7.4x
DBpedia_en 24.63 fail 25.34 2.97 7.6x
DBpedia_de n/a n/a 10.34 1.2 7.3x
DBpedia_fr n/a n/a 10.49 1.27 7.3x
Performance evaluation of DistLODStats
Evaluation
27Node scalability (BSBM-50GB) Sizeup scalability
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
28
Quality Assessment of RDF
Datasets at Scale
A Scalable Framework for Quality Assessment of
RDF Datasets [3]
29
Assessing data quality is of paramount importance to judge its fitness
for particular use case
Existing solutions can not evaluate data quality metrics on medium /
large-scale datasets
→ This is actually where they are most important
Motivation
30
Quality Assessment Pattern (QAP)
- A reusable template to implement and design scalable quality
metrics
Approach
31
Quality Metric(QM) := Action|(QM OP Action)
OP := ∗|−|/|+
Action := Count(Transformation)
Transformation := Rule(Filter)|(Transformation BOP Transformation)
Filter := getPredicates∼?p|getSubjects∼?s|getObjects∼?o|getDistinct(Filter)
|Filter or Filter|Filter && Filter)
Rule := isURI(Filter)|isIRI(Filter)|isInternal(Filter)|isLiteral(Filter)
|!isBroken(Filter)|hasPredicateP|hasLicenceAssociated(Filter)
|hasLicenceIndications(Filter)|isExternal(Filter)|hasType((Filter)
|isLabeled(Filter)
BOP := ∩|∪
Architecture Overview
32
Definition
● Define quality dimensions
● Define quality metrics, threshold and other configurations
RDF Data
Qualityassessment
SANSA Engine
DataIngestion
Distributed Data
Structures
QAP
Results
Analyse
SANSA-NotebooksData Quality Vocabulary (DQV)
Experimental Setup
- Cluster configuration
- 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
Local mode: single instance of the cluster
- Datasets (all in .nt format)
Evaluation
33
DBpedia BSBM
LinkedGeoData en de fr 2GB 20GB 200GB
#nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057
size (GB) 191.17 114.4 48.6 49.77 2 20 200
Evaluation
34
Runtime (in minutes)
Luzzu DistQualityAssessment
-----> a) single b) joint c) local d) cluster
LinkedGeoData Fail Fail 446.9 7.79
DBpedia_en Fail Fail 274.31 1.99
DBpedia_de Fail Fail 61.4 0.46
DBpedia_fr Fail Fail 195.3 0.38
BSBM_200GB Fail Fail 454.46 7.27
BSBM_0.01GB 2.64 2.65 0.04 0.42
BSBM_0.05GB 16.38 15.39 0.05 0.46
BSBM_0.1GB 40.59 37.94 0.06 0.44
BSBM_0.5GB 459.19 468.64 0.15 0.48
BSBM_1GB 1454.16 1532.95 0.4 0.56
BSBM_2GB Timeout Timeout 3.19 0.62
BSBM_10GB Timeout Timeout 29.44 0.52
BSBM_20GB Fail Fail 34.32 0.75
Large-scaleSmalltomedium
Performance evaluation of DistQualityAssessment
Evaluation
35Node scalability (BSBM-200GB) Sizeup scalability
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
36
Scalable RDF Querying
Sparklify: A Scalable Software Component for
Efficient evaluation of SPARQL queries over
distributed RDF datasets* [4]
37* A joint work with Claus Stadler, a PhD student at the University of Leipzig.
Existing solutions are narrowed down to simple RDF constructs only
Hence they do not exploit the full potential of the knowledge i.e. RDF
terms
Can we re-use existing Ontology-Based Data Access (OBDA) tooling to
facilitate running SPARQL queries on RDF kept in Apache Spark?
Motivation
38
Sparklify: Architecture Overview
39
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
Experimental Setup
- Cluster configuration
- 7 nodes (1 master, 6 worker), each with Intel(R) Xeon(R) CPU E5-2620 v4 @
2.10GHz (32
- Cores), 128 GB RAM, 12 TB SATA RAID-5, connected via a Gigabit network
- Each experiment executed 3 times, avg’ed results
Datasets (all in .nt format)
Evaluation
40
LUBM WatDiv
1K 5K 10K 10M 100M 1B
#nr. of triples 138,280,374 690,895,862 1,381,692,508 10,916,457 108,997,714 1,099,208,068
size (GB) 24 116 232 1.5 15 150
Evaluation
41
Runtime (s) (mean)
SPARQLGX-SDE Sparklify
-----> a) total b) partitioning c) querying d) total
QC 103.24 134.81 61 195.84
QF 157.8 236.06 107.33 349.51
QL 102.51 241.24 134 370.3
QS 131.16 237.12 108.56 346
QC partial fail 778.62 2043.66 2829.56
QF 6734.68 1295.3 2576.52 3871.97
QL 2575.72 1275.22 610.66 1886.73
QS 4841.85 1290.72 1552.05 2845.3
Watdiv-1BWatdiv-10M
Evaluation
42
Runtime (s) (mean)
SPARQLGX-SDE Sparklify
-----> a) total b) partitioning c) querying d) total
Q1 1056.83 627.72 718.11 1346.8
Q2 fail 595.76 fail n/a
Q3 1038.62 615.95 648.63 1267.37
Q4 2761.11 632.93 1670.18 2303.18
Q5 1026.94 641.53 564.13 1206.67
Q6 537.65 695.74 267.48 963.62
Q7 2080.67 630.44 1331.13 1967.25
Q8 2636.12 639.93 1647.57 2288.48
Q9 3124.52 583.86 2126.03 2711.24
Q10 1002.56 593.68 693.73 1287.71
Q11 1023.32 594.41 522.24 1118.58
Q12 2027.59 576.31 1088.25 1665.87
Q13 1007.39 626.57 6.66 633.26
Q14 526.15 633.39 258.32 891.89
LUBM-10K
Performance evaluation of Sparklify
Evaluation
43Node scalability (WatDiv 100M) Sizeup scalability
Sparklify vs SPARQLGX-SDE per query type performance on WatDiv
100M
Evaluation
44Query Types: (QS: Star pattern, QL: Linear pattern, QF: Snowflake, QC: Complex pattern)
Scalable RDF Querying
Towards A Scalable Semantic-based Distributed
Approach for SPARQL query evaluation [5]
45
Are existing solutions more effective i.e. using property tables which
leads to reducing the number of necessary joins and unions?
What happens when not all subjects in a cluster will use all properties?
- Wide property tables may be very sparse containing many NULL
values and thus impose a large storage overhead
How about using a more flatten approach? i.e. partition into
subject-based grouping (e.g. all entities which are associated with a
unique subject)
Motivation
46
Semantic-Based: Architecture Overview
47
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
Results
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn
?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Experimental Setup
- Cluster configuration
- 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
- Datasets (all in nt format)
- Distributed SPARQL query evaluators we compare with:
- SHARD, SPARQLGX-SDE, and Sparklify
Evaluation
48
LUBM WatDiv
1K 2K 3K 10M 100M
#nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714
size (GB) 24 49 70 1.5 15
Evaluation
49
Runtime (s) (mean)
Queries SHARD
SPARQLGX-SD
E SANSA.Sparklify SANSA.Semantic
C3 n/a 38.79 72.94 90.48
F3 n/a 38.41 74.69 n/a
L3 n/a 21.05 73.16 72.84
S3 n/a 26.27 70.1 79.7
C3 n/a 181.51 96.59 300.82
F3 n/a 162.86 91.2 n/a
L3 n/a 84.09 82.17 189.89
S3 n/a 123.6 93.02 176.2
Watdiv-10MWatdiv-100M
Evaluation
50
Runtime (s) (mean)
Queries SHARD
SPARQLGX-SD
E SANSA.Sparklify SANSA.Semantic
Q1 774.93 103.74 103.57 226.21
Q2 fail fail 3348.51 329.69
Q3 772.55 126.31 107.25 235.31
Q4 988.28 182.52 111.89 294.8
Q5 771.69 101.05 100.37 226.21
Q6 fail 73.05 100.72 207.06
Q7 fail 160.94 113.03 277.08
Q8 fail 179.56 114.83 309.39
Q9 fail 204.62 114.25 326.29
Q10 780.05 106.26 110.18 232.72
Q11 783.2 112.23 105.13 231.36
Q12 fail 159.65 105.86 283.53
Q13 778.16 100.06 90.87 220.28
Q14 688.44 74.64 100.58 204.43
LUBM-1K
Performance evaluation of Semantic-based approach
Evaluation
51Node scalability (LUBM-1K) Sizeup scalability
Powered By
Project and Organizations using our proposed
approaches
52
53
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<http://slipo.eu/>
Categorizing Areas
of Interests (AOI)
SLIPO focuses on designing
efficient pipelines dealing with
large semantic datasets of POIs.
In this project, Sparklify is used
through the SANSA query layer
to refine, filter and select the
relevant POIs which are needed
by the pipelines
10+ more use cases
http://sansa-stack.net/powered-by/
Powered By
The Hubs and Authorities Transaction
Network Analysis
54
Amazon S3
buckets
EthOn RDF
triples
Connected Components
SANSA Engine
Data ingestion
Data partition
Querying (SPARQL)
Hubs & Authorities
entities
PageRank
Connected
Components
Top Accounts, Hubs & Authorities, Wallet
Exchange behaviorData visualization using the
Databricks notebooks or SANSA
notebooks
More than 18,000,000,000 facts*
*https://medium.com/alethio/ethereum-linked-data-b72e6283812f
Analyze game performance and customer behaviors at scale
Profiting from Kitties on Ethereum
55
Pipe different clustering algorithms at once
Scalable Integration of Big POI Data
56
RDF POI
Data
Pre
processing
SPARQL
Filtering
POI_ID Cat1 Cat2
1 0 1
2 1 0
3 0 1
4 1 1
Word Embedding
Semantic Clustering
Geo
Clustering
Conclusion and Future
Directions
57
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
- First algorithm for computing RDF dataset statistics at scale using
Apache Spark
- An analysis of the complexity of the computational steps and the
data exchange between nodes in the cluster
- Integrated the approach into the SANSA framework
- A REST Interface for triggering RDF statistics calculation
Review of the Contributions
58
RQ2: Can we scale RDF dataset quality assessment horizontally?
- A Quality Assessment Pattern QAP to characterize scalable quality
metrics
- A distributed (open source) implementation of quality metrics using
Apache Spark
- Analysis of the complexity of the metric evaluation
- Evaluate our approach and demonstrate empirically its superiority
over a previous centralized approach
- Integrated the approach into the SANSA framework
Review of the Contributions
59
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
- A novel approach for vertical partitioning including RDF terms and a
scalable query system (Sparklify) using SPARQL-to-SQL rewriter on
top of Apache Spark
- A scalable semantic-based partitioning and semantic-based query
engine (SANSA.Semantic) on top of Apache Spark
- Evaluation of the proposed approaches with state-of-the-art
engines and demonstrate it empirically
- Integrated the approaches into the SANSA framework
Review of the Contributions
60
Large-scale RDF Dataset Statistics
- Our approach is purely batch processing, in which the data chunks
are normally very large, therefore we plan to investigate additional
techniques for lowering the network overhead and I/O footprint i.e.
HDT compression
- Near real-time computation of RDF dataset statistics using Spark
Streaming
Limitations and Future Directions
61
Assessment of RDF Datasets at Scale
- Intelligent partitioning strategies and perform dependency analysis
in order to evaluate multiple metrics simultaneously
- Real-time interactive quality assessment of large-scale RDF data
using Spark Streaming
- A declarative plugin using Quality Metric Language (QML), with the
ability to express, customize and enhance quality metrics
- Quality Assessment As a Service
- Quality check over LODStats
Limitations and Future Directions
62
Scalable RDF Querying
- Combine OBDA tools with dictionary encoding of RDF terms as
integers and evaluate the effects
- Extend our parser to support more SPARQL fragments and adding
statistics to the query engine while evaluating queries
- Investigate the re-ordering of the BGPs and evaluate the effects on
query execution time
- Consider other management operations i.e. additions, updates,
deletions i.e. DeltaLake solution as an alternative for storage layer
that brings ACID transactions to RDF data management solutions
Limitations and Future Directions
63
Adaptive Distributed RDF Querying
- Optimize index structures and distribute data based on anticipated
query workloads of particular inference or ML algorithms
Efficient Recommendation System for RDF Partitioners
- A recommender to suggest the “best partitioner” for our SPARQL
query evaluators based on the structure of the data (statistics)
A Powerful Benchmarking Suite
Limitations and Future Directions
64
With the increasing amount of the RDF data, processing large-scale RDF
datasets are constantly facing challenges
We have shown the benefits of using distributed computing frameworks
for a scalable and efficient processing of RDF datasets
Future research work can build upon the contributions presented during
this thesis for a comprehensive scalable processing of RDF datasets
The main contributions of this thesis have been integrated within the
SANSA framework making an impact on the semantic web community
Closing Remarks
65
66
@Gezim_Sejdiu
https://gezimsejdiu.github.io/
That’s all folks
>> SANSA: https://github.com/SANSA-Stack
[1]. Distributed Semantic Analytics using the SANSA Stack. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick
Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga;
and Hajira Jabeen. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC'2017), 2017.
[2]. DistLODStats: Distributed Computation of RDF Dataset Statistics. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed Nadjib-Mami. In Proceedings of 17th International Semantic Web Conference, 2018.
[3]. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira
Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019.
[4]. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets.
Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web
Conference, 2019.
[5]. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu; Damien
Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann. In 15th International Conference on Semantic
Systems (SEMANTiCS), 2019.
References
67
Backup slides
68
SPARQL is a standard query language for retrieving and manipulating
RDF data
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?hq ?location
WHERE {
dbr:Deutsche_Post foaf:name ?name.
dbr:Deutsche_Post dbo:location ?hq.
?hq foaf:name ?location.
}
Querying Knowledge Graphs
69
Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
70
Source: LOD-Cloud (http://lod-cloud.net/ )
Speedup Ratio and Efficiency of DistLODStats
Evaluation
71
Overall Breakdown of DistLODStats by Criterion Analysis (log scale)
Evaluation
72
STATisfy: A REST Interface for DistLODStats
73
CollaborativeAnalyticsServices
Marketplace
REST
Server
BigDataEurope
Local Cluster
Standalone Resource manager
Master
Worker 1 Worker 2 Worker n
SANSA DistLODStats
QAP: consists of transformations and actions
- Transformation: Rule set or a union/intersection of transformations
- Rules: defines conditional criteria for a triple e.g. isIRI()
- Filter: retrieves a subset of an RDF triple, e.g. getPredicates
- Shortcuts ?s, ?p, ?o are frequently used for filters
- Action: maps a triple set to a numerical value, e.g. count(r)
Quality Assessment Patterns (QAPs)
74
Metric Transformation τ Action α
External Linkage r_1 = isIRI(?s)∩internal(?s)∩isIRI(?o)∩external(?o) α_1 = count(r_3)
r_2 = isIRI(?s)∩external(?s)∩isIRI(?o)∩internal(?o) α_2 = count(triples)
r_3 = r_1∪r_2 α= a_1/a_2
Overall analysis of DistQualityAssessment by metric in the cluster mode
(log scale)
Evaluation
75
Overall analysis of queries on LUBM-1K dataset (cluster mode) using
Semantic-based approach
Evaluation
76

Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva

  • 1.
    Efficient Distributed In-Memory Processingof RDF Datasets Gezim Sejdiu PhD Colloquium, Bonn 29.09.2020 Supervisor: Prof. Dr. Jens Lehmann
  • 2.
    Introduction Large-Scale RDF DatasetStatistics Quality Assessment of RDF Datasets at Scale Scalable RDF Querying Use Cases and Applications Conclusion & Future Directions Outline 2
  • 3.
  • 4.
    No single definition Extremelylarge data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them What is Big Data? 4
  • 5.
    It’s relevance isincreasing drastically and Big Data Analytics is an emerging field to explore Why ‘BigData’ is so important? 5 https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
  • 6.
  • 7.
  • 8.
    Big Data Europe(BDE) Platform 8https://github.com/big-data-europe Support Layer Init Daemon GUIs Monitor App Layer Traffic Forecast Satellite Image Analysis Platform Layer Spark Flink Semantic Layer Ontario SANSA Semagrow Kafka Real-time Stream Monitoring ... ... Resource Management Layer (Swarm) Hardware Layer Premises Cloud (AWS, GCP, MS Azure, …) Data Layer Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
  • 9.
    Fast and generic-purposecluster computing engine Apache Spark 9 Spark Core Engine (RDD) Deploy SparkSQL& DataFrames CoreAPIs& Libraries SparkStreaming Local Single JVM Cluster (Standalone, Mesos, YARN) Containers docker-comp ose MLlib MachineLearning GraphX Graphprocessing Allows for massive parallel processing of collections of records - RDD - Resilient Distributed Dataset - DataFrame - Conceptually a table - Dataset - Unified access to data as objects and/or tables
  • 10.
    Heterogeneity aka Variety KeyObservation From BDE 10 Banking Finance Our Known History PurchaseEntertain Gaming Social Media VISA CHASE SAP IBM NORDSTROM Amazon LOWES NETFLIX HULU NFb NETWORK Zynga XBOX 360 Facebook Pinterest Twitter Customer
  • 11.
    Modelling entities andtheir relationships The RDF (Resource Description Framework) model Knowledge Graphs 11 DPDHL Deutsche Post DHL Group full name Logistics industry Logistik label PostTower headquarters Bonn located in
  • 12.
    Modelling entities andtheir relationships Analysis: finding underlying structure of the graph e.g. to predict unknown relationships Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData Knowledge Graphs 12
  • 13.
    Knowledge Graphs areeverywhere 13 Entity Search and Summarization Discovering Related Entities
  • 14.
    Tasks that arehard to solve on single machines (>1 TB memory consumption): - Querying and processing LinkedGeoData - Dataset statistics and quality assessment of the LOD Cloud - Vandalism and outlier detection in DBpedia - Inference on life science data (e.g. UniProt, EggNOG, StringDB) - Clustering of DBpedia data - Large-scale enrichment and link prediction for e.g. DBpedia → LinkedGeoData Why Distributed RDF Data Processing? 14
  • 15.
    Main Research Question Isit possible to process large-scale RDF datasets efficiently and effectively? 15
  • 16.
    RQ1: How canwe efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 16
  • 17.
    RC1: A ScalableDistributed Approach for Computation of RDF Dataset Statistics RC2: A Scalable Framework for Quality Assessment of RDF Datasets RC3: A Scalable Framework for SPARQL Evaluation of Large RDF Data Contributions 17
  • 18.
  • 19.
    SANSA [1] isa processing data flow engine that provides data distribution, and fault tolerance for distributed computation over large-scale RDF datasets SANSA includes several libraries: - Read / Write RDF / OWL library - Querying library - Inference library - ML library SANSA 19 BigDataEurope Inference Knowledge Distribution & Representation DeployCoreAPIs&Libraries Local Cluster Standalone Resource manager Querying Machine Learning
  • 20.
    RQ1: How canwe efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 20
  • 21.
    Large-Scale RDF Dataset Statistics AScalable Distributed Approach for Computation of RDF Dataset Statistics [2] 21
  • 22.
    Obtaining an overviewover the Web of Data, it is important to gather statistical information describing characteristics of the internal structure of datasets This process is both data-intensive and computing-intensive and it is a challenge to develop fast and efficient algorithms that can handle large scale RDF datasets There are no approaches for RDF that computes those statistical criteria and scales to large data sets Motivation 22
  • 23.
    A statistical criterionC is a triple C = (F, D, P), where: - F is a SPARQL filter condition - D is a derived dataset from the main dataset (RDD of triples) after applying F - P is a post-processing operation on the data structure D RDDs are in-memory collections of records that can be operated in parallel on large clusters - We use RDDs to represent RDF triples Approach 23
  • 24.
  • 25.
    Experimental Setup - Clusterconfiguration - 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5 - Spark-2.2.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 - Datasets (all in nt format) Evaluation 25 DBpedia BSBM LinkedGeoData en de fr 2GB 20GB 200GB #nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057 size (GB) 191.17 114.4 48.6 49.77 2 20 200
  • 26.
    Distributed Processing onLarge-Scale Datasets * e) = d) / c) - 1 Evaluation 26 Runtime (in hours) LODStats DistLODStats a) files b) bigfile c) local d) cluster e) speedup ratio LinkedGeoData n/a n/a 36.65 4.37 7.4x DBpedia_en 24.63 fail 25.34 2.97 7.6x DBpedia_de n/a n/a 10.34 1.2 7.3x DBpedia_fr n/a n/a 10.49 1.27 7.3x
  • 27.
    Performance evaluation ofDistLODStats Evaluation 27Node scalability (BSBM-50GB) Sizeup scalability
  • 28.
    RQ1: How canwe efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 28
  • 29.
    Quality Assessment ofRDF Datasets at Scale A Scalable Framework for Quality Assessment of RDF Datasets [3] 29
  • 30.
    Assessing data qualityis of paramount importance to judge its fitness for particular use case Existing solutions can not evaluate data quality metrics on medium / large-scale datasets → This is actually where they are most important Motivation 30
  • 31.
    Quality Assessment Pattern(QAP) - A reusable template to implement and design scalable quality metrics Approach 31 Quality Metric(QM) := Action|(QM OP Action) OP := ∗|−|/|+ Action := Count(Transformation) Transformation := Rule(Filter)|(Transformation BOP Transformation) Filter := getPredicates∼?p|getSubjects∼?s|getObjects∼?o|getDistinct(Filter) |Filter or Filter|Filter && Filter) Rule := isURI(Filter)|isIRI(Filter)|isInternal(Filter)|isLiteral(Filter) |!isBroken(Filter)|hasPredicateP|hasLicenceAssociated(Filter) |hasLicenceIndications(Filter)|isExternal(Filter)|hasType((Filter) |isLabeled(Filter) BOP := ∩|∪
  • 32.
    Architecture Overview 32 Definition ● Definequality dimensions ● Define quality metrics, threshold and other configurations RDF Data Qualityassessment SANSA Engine DataIngestion Distributed Data Structures QAP Results Analyse SANSA-NotebooksData Quality Vocabulary (DQV)
  • 33.
    Experimental Setup - Clusterconfiguration - 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 Local mode: single instance of the cluster - Datasets (all in .nt format) Evaluation 33 DBpedia BSBM LinkedGeoData en de fr 2GB 20GB 200GB #nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057 size (GB) 191.17 114.4 48.6 49.77 2 20 200
  • 34.
    Evaluation 34 Runtime (in minutes) LuzzuDistQualityAssessment -----> a) single b) joint c) local d) cluster LinkedGeoData Fail Fail 446.9 7.79 DBpedia_en Fail Fail 274.31 1.99 DBpedia_de Fail Fail 61.4 0.46 DBpedia_fr Fail Fail 195.3 0.38 BSBM_200GB Fail Fail 454.46 7.27 BSBM_0.01GB 2.64 2.65 0.04 0.42 BSBM_0.05GB 16.38 15.39 0.05 0.46 BSBM_0.1GB 40.59 37.94 0.06 0.44 BSBM_0.5GB 459.19 468.64 0.15 0.48 BSBM_1GB 1454.16 1532.95 0.4 0.56 BSBM_2GB Timeout Timeout 3.19 0.62 BSBM_10GB Timeout Timeout 29.44 0.52 BSBM_20GB Fail Fail 34.32 0.75 Large-scaleSmalltomedium
  • 35.
    Performance evaluation ofDistQualityAssessment Evaluation 35Node scalability (BSBM-200GB) Sizeup scalability
  • 36.
    RQ1: How canwe efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 36
  • 37.
    Scalable RDF Querying Sparklify:A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets* [4] 37* A joint work with Claus Stadler, a PhD student at the University of Leipzig.
  • 38.
    Existing solutions arenarrowed down to simple RDF constructs only Hence they do not exploit the full potential of the knowledge i.e. RDF terms Can we re-use existing Ontology-Based Data Access (OBDA) tooling to facilitate running SPARQL queries on RDF kept in Apache Spark? Motivation 38
  • 39.
    Sparklify: Architecture Overview 39 Sparqlify SANSA SANSAEngine RDF Layer Data Ingestion Partitioning Query Layer Sparklifying Views Views Distributed Data Structures Results RDFData SELECT ?s ?w WHERE { ?s a dbp:Person . ?s ex:workPage ?w . } SPARQL Prefix dbp:<http://dbpedia.org/ontology/> Prefix ex:<http://ex.org/> Create View view_person As Construct { ?s a dbp:Person . ?s ex:workPage ?w . } With ?s = uri('http://mydomain.org/person', ?id) ?w = uri(?work_page) Constrain ?w prefix "http://my-organization.org/user/" From person; SELECT id, work_page FROM view_person ; SQLAET SPARQL query SPARQL Algebra Expression Tree (AET) Normalize AET
  • 40.
    Experimental Setup - Clusterconfiguration - 7 nodes (1 master, 6 worker), each with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 - Cores), 128 GB RAM, 12 TB SATA RAID-5, connected via a Gigabit network - Each experiment executed 3 times, avg’ed results Datasets (all in .nt format) Evaluation 40 LUBM WatDiv 1K 5K 10K 10M 100M 1B #nr. of triples 138,280,374 690,895,862 1,381,692,508 10,916,457 108,997,714 1,099,208,068 size (GB) 24 116 232 1.5 15 150
  • 41.
    Evaluation 41 Runtime (s) (mean) SPARQLGX-SDESparklify -----> a) total b) partitioning c) querying d) total QC 103.24 134.81 61 195.84 QF 157.8 236.06 107.33 349.51 QL 102.51 241.24 134 370.3 QS 131.16 237.12 108.56 346 QC partial fail 778.62 2043.66 2829.56 QF 6734.68 1295.3 2576.52 3871.97 QL 2575.72 1275.22 610.66 1886.73 QS 4841.85 1290.72 1552.05 2845.3 Watdiv-1BWatdiv-10M
  • 42.
    Evaluation 42 Runtime (s) (mean) SPARQLGX-SDESparklify -----> a) total b) partitioning c) querying d) total Q1 1056.83 627.72 718.11 1346.8 Q2 fail 595.76 fail n/a Q3 1038.62 615.95 648.63 1267.37 Q4 2761.11 632.93 1670.18 2303.18 Q5 1026.94 641.53 564.13 1206.67 Q6 537.65 695.74 267.48 963.62 Q7 2080.67 630.44 1331.13 1967.25 Q8 2636.12 639.93 1647.57 2288.48 Q9 3124.52 583.86 2126.03 2711.24 Q10 1002.56 593.68 693.73 1287.71 Q11 1023.32 594.41 522.24 1118.58 Q12 2027.59 576.31 1088.25 1665.87 Q13 1007.39 626.57 6.66 633.26 Q14 526.15 633.39 258.32 891.89 LUBM-10K
  • 43.
    Performance evaluation ofSparklify Evaluation 43Node scalability (WatDiv 100M) Sizeup scalability
  • 44.
    Sparklify vs SPARQLGX-SDEper query type performance on WatDiv 100M Evaluation 44Query Types: (QS: Star pattern, QL: Linear pattern, QF: Snowflake, QC: Complex pattern)
  • 45.
    Scalable RDF Querying TowardsA Scalable Semantic-based Distributed Approach for SPARQL query evaluation [5] 45
  • 46.
    Are existing solutionsmore effective i.e. using property tables which leads to reducing the number of necessary joins and unions? What happens when not all subjects in a cluster will use all properties? - Wide property tables may be very sparse containing many NULL values and thus impose a large storage overhead How about using a more flatten approach? i.e. partition into subject-based grouping (e.g. all entities which are associated with a unique subject) Motivation 46
  • 47.
    Semantic-Based: Architecture Overview 47 SANSAEngine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map Distributed Data Structures Results RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany
  • 48.
    Experimental Setup - Clusterconfiguration - 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 - Datasets (all in nt format) - Distributed SPARQL query evaluators we compare with: - SHARD, SPARQLGX-SDE, and Sparklify Evaluation 48 LUBM WatDiv 1K 2K 3K 10M 100M #nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714 size (GB) 24 49 70 1.5 15
  • 49.
    Evaluation 49 Runtime (s) (mean) QueriesSHARD SPARQLGX-SD E SANSA.Sparklify SANSA.Semantic C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Watdiv-10MWatdiv-100M
  • 50.
    Evaluation 50 Runtime (s) (mean) QueriesSHARD SPARQLGX-SD E SANSA.Sparklify SANSA.Semantic Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 LUBM-1K
  • 51.
    Performance evaluation ofSemantic-based approach Evaluation 51Node scalability (LUBM-1K) Sizeup scalability
  • 52.
    Powered By Project andOrganizations using our proposed approaches 52
  • 53.
    53 <https://aleth.io/> Blockchain – Alethio UseCase Alethio is using SANSA in order to perform large-scale batch analytics, e.g. computing the asset turnover for sets of accounts, computing attack pattern frequencies and Opcode usage statistics. SANSA was run on a 100 node cluster with 400 cores <https://www.big-data-europe.eu/> Big Data Platform – BDE SANSA is used for computing statistics over those logs within the BDE platform. BDE uses the Mu Swarm Logger service for detecting docker events and convert their representation to RDF. In order to generate visualisations of log statistics, BDE then calls DistLODStats from SANSA-Notebooks <http://slipo.eu/> Categorizing Areas of Interests (AOI) SLIPO focuses on designing efficient pipelines dealing with large semantic datasets of POIs. In this project, Sparklify is used through the SANSA query layer to refine, filter and select the relevant POIs which are needed by the pipelines 10+ more use cases http://sansa-stack.net/powered-by/ Powered By
  • 54.
    The Hubs andAuthorities Transaction Network Analysis 54 Amazon S3 buckets EthOn RDF triples Connected Components SANSA Engine Data ingestion Data partition Querying (SPARQL) Hubs & Authorities entities PageRank Connected Components Top Accounts, Hubs & Authorities, Wallet Exchange behaviorData visualization using the Databricks notebooks or SANSA notebooks More than 18,000,000,000 facts* *https://medium.com/alethio/ethereum-linked-data-b72e6283812f
  • 55.
    Analyze game performanceand customer behaviors at scale Profiting from Kitties on Ethereum 55
  • 56.
    Pipe different clusteringalgorithms at once Scalable Integration of Big POI Data 56 RDF POI Data Pre processing SPARQL Filtering POI_ID Cat1 Cat2 1 0 1 2 1 0 3 0 1 4 1 1 Word Embedding Semantic Clustering Geo Clustering
  • 57.
  • 58.
    RQ1: How canwe efficiently explore the structure of large-scale RDF datasets? - First algorithm for computing RDF dataset statistics at scale using Apache Spark - An analysis of the complexity of the computational steps and the data exchange between nodes in the cluster - Integrated the approach into the SANSA framework - A REST Interface for triggering RDF statistics calculation Review of the Contributions 58
  • 59.
    RQ2: Can wescale RDF dataset quality assessment horizontally? - A Quality Assessment Pattern QAP to characterize scalable quality metrics - A distributed (open source) implementation of quality metrics using Apache Spark - Analysis of the complexity of the metric evaluation - Evaluate our approach and demonstrate empirically its superiority over a previous centralized approach - Integrated the approach into the SANSA framework Review of the Contributions 59
  • 60.
    RQ3: Can distributedRDF datasets be queried efficiently and effectively? - A novel approach for vertical partitioning including RDF terms and a scalable query system (Sparklify) using SPARQL-to-SQL rewriter on top of Apache Spark - A scalable semantic-based partitioning and semantic-based query engine (SANSA.Semantic) on top of Apache Spark - Evaluation of the proposed approaches with state-of-the-art engines and demonstrate it empirically - Integrated the approaches into the SANSA framework Review of the Contributions 60
  • 61.
    Large-scale RDF DatasetStatistics - Our approach is purely batch processing, in which the data chunks are normally very large, therefore we plan to investigate additional techniques for lowering the network overhead and I/O footprint i.e. HDT compression - Near real-time computation of RDF dataset statistics using Spark Streaming Limitations and Future Directions 61
  • 62.
    Assessment of RDFDatasets at Scale - Intelligent partitioning strategies and perform dependency analysis in order to evaluate multiple metrics simultaneously - Real-time interactive quality assessment of large-scale RDF data using Spark Streaming - A declarative plugin using Quality Metric Language (QML), with the ability to express, customize and enhance quality metrics - Quality Assessment As a Service - Quality check over LODStats Limitations and Future Directions 62
  • 63.
    Scalable RDF Querying -Combine OBDA tools with dictionary encoding of RDF terms as integers and evaluate the effects - Extend our parser to support more SPARQL fragments and adding statistics to the query engine while evaluating queries - Investigate the re-ordering of the BGPs and evaluate the effects on query execution time - Consider other management operations i.e. additions, updates, deletions i.e. DeltaLake solution as an alternative for storage layer that brings ACID transactions to RDF data management solutions Limitations and Future Directions 63
  • 64.
    Adaptive Distributed RDFQuerying - Optimize index structures and distribute data based on anticipated query workloads of particular inference or ML algorithms Efficient Recommendation System for RDF Partitioners - A recommender to suggest the “best partitioner” for our SPARQL query evaluators based on the structure of the data (statistics) A Powerful Benchmarking Suite Limitations and Future Directions 64
  • 65.
    With the increasingamount of the RDF data, processing large-scale RDF datasets are constantly facing challenges We have shown the benefits of using distributed computing frameworks for a scalable and efficient processing of RDF datasets Future research work can build upon the contributions presented during this thesis for a comprehensive scalable processing of RDF datasets The main contributions of this thesis have been integrated within the SANSA framework making an impact on the semantic web community Closing Remarks 65
  • 66.
  • 67.
    [1]. Distributed SemanticAnalytics using the SANSA Stack. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga; and Hajira Jabeen. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC'2017), 2017. [2]. DistLODStats: Distributed Computation of RDF Dataset Statistics. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed Nadjib-Mami. In Proceedings of 17th International Semantic Web Conference, 2018. [3]. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019. [4]. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019. [5]. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu; Damien Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019. References 67
  • 68.
  • 69.
    SPARQL is astandard query language for retrieving and manipulating RDF data PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?hq ?location WHERE { dbr:Deutsche_Post foaf:name ?name. dbr:Deutsche_Post dbo:location ?hq. ?hq foaf:name ?location. } Querying Knowledge Graphs 69
  • 70.
    Over the lastyears, the size of the Semantic Web has increased and several large-scale datasets were published > As of March 2019 ~10, 000 datasets Openly available online using Semantic Web standards + many datasets RDFized and kept private Motivation 70 Source: LOD-Cloud (http://lod-cloud.net/ )
  • 71.
    Speedup Ratio andEfficiency of DistLODStats Evaluation 71
  • 72.
    Overall Breakdown ofDistLODStats by Criterion Analysis (log scale) Evaluation 72
  • 73.
    STATisfy: A RESTInterface for DistLODStats 73 CollaborativeAnalyticsServices Marketplace REST Server BigDataEurope Local Cluster Standalone Resource manager Master Worker 1 Worker 2 Worker n SANSA DistLODStats
  • 74.
    QAP: consists oftransformations and actions - Transformation: Rule set or a union/intersection of transformations - Rules: defines conditional criteria for a triple e.g. isIRI() - Filter: retrieves a subset of an RDF triple, e.g. getPredicates - Shortcuts ?s, ?p, ?o are frequently used for filters - Action: maps a triple set to a numerical value, e.g. count(r) Quality Assessment Patterns (QAPs) 74 Metric Transformation τ Action α External Linkage r_1 = isIRI(?s)∩internal(?s)∩isIRI(?o)∩external(?o) α_1 = count(r_3) r_2 = isIRI(?s)∩external(?s)∩isIRI(?o)∩internal(?o) α_2 = count(triples) r_3 = r_1∪r_2 α= a_1/a_2
  • 75.
    Overall analysis ofDistQualityAssessment by metric in the cluster mode (log scale) Evaluation 75
  • 76.
    Overall analysis ofqueries on LUBM-1K dataset (cluster mode) using Semantic-based approach Evaluation 76