Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases
1. Workload-Aware RDF Partitioning and
SPARQL Query Caching for Massive RDF
Graphs stored in NoSQL Databases
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
2. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
2
3. Introduction: Motivation
● Since the of Semantic Web proposal in 2001, many advances introduced by
W3C
● RDF and SPARQL is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr
iteweb-how-best-buy-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201
3
○ US data.gov:
■ https://www.data.gov/developers/semantic-web
3
5. Introduction: Motivation
● Research problem
○ Storing/querying large RDF graphs
■ No single node can handle the complete graph
■ Native RDF storage can’t scale to the current data
requirements
■ Inter-partitions joins is very costly
● Research hypothesis
○ A workload-aware approach based on distributed
polyglot NoSQL persistence could be a good solution 5
6. Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
7. Introduction: Contributions
● Mapping of RDF to columnar, document, and key/value NoSQL models;
● A workload-aware partitioner based on the current graph structure and,
mainly, on the typical application workload;
● A caching schema based on key/value databases for speeding up the query
response time;
● An experimental evaluation that compares the current version of our approach
against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache
Cassandra and MongoDB, the most popular key/value, columnar and
document NoSQL databases, respectively.
7
8. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
8
10. Background: NoSQL
● No SQL interface
● No ACID transactions
● Very scalable
● Schemaless
https://db-engines.com/en/ranking
10
11. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
11
12. State of the Art - Triplestores
Triplestore Frag. Replication Partitioning Model In-memory Workload-aware
Hexastore (2008) No No No Native No No
SW-Store (2009) No No Vertical SQL No No
CumulusRDF
(2011)
No No Vertical Columnar
(Cassandra)
No No
SPOVC (2012) No No Horizontal Columnar
(MonetDB)
No No
WARP (2013) Yes N-hop replication on
partition boundary
Hash Native No Dynamic
Rainbow (2015) No No Hash Polyglot K/V cache Static
ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No
Rendezvous Yes N-hop replication fragment
and on partition boundary
V and H Polyglot K/V and local
cache
Dynamic
Key differentials
13. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
13
14. Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
14
17. Workload awareness
Given the graph:
If the following query is issued:
SELECT ?x WHERE {
B p2 C .
C p3 x?
}
SELECT ?x WHERE {
F p6 G .
F p9 L .
F p8 x?}
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
Indexed by the predicate
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Indexed by the subject/object
Dataset
Characterizer
17
19. Star Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
F C
p10
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
F tends to be in star queries
with diameter 1,
so we expand the triple
Fp10C to a 1-hop fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
Fp10C will be
stored
19
20. Star Fragmentation (mapping)
With the expanded fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
{
subject: F,
p6: G,
p7: I,
p8: H,
p10: C,
p9: L,
p5: {
object: B
}}
We translate it to a JSON
document:
Document
database
20
21. Chain Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
C G
p3
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
p3 tends to be in chain queries with
max-diameter 1, so we expand the
triple Cp3G to a 1-hop fragment
B
C
F G
p2
p3
p6
D
p3
Cp3G will be
stored
21
22. Chain Fragmentation (mapping)
With the expanded fragment We translate it to a set of
columnar tables:
B
C
F G
p2
p3
p6
D
p3
p2
Obj Subj
B C
p3
Obj Subj
C D
C G
p6
Obj Subj
F G
Columnar
database
22
24. Indexing
S_PO
...
F {p10C}
C {p3G}
O_SP
...
C {Fp10}
G {Gp3}
Indexer
Each new triple is indexed by the
subject and the object
It helps on a triple expansion, and to solve simple queries like:
SELECT ?x WHERE {F p10 x? }
24
26. Partitioning
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a graph is bigger than a
server capabilities, the
Rendezvous DBA can
create multiple partitions
Columnar
database
Document
database
P3
P1
P2
Each NoSQL server can hold one or
more partitions and each partition is
in only one server.
26
27. Partitioning
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
Rendezvous manages the
partitions by saving it on
the dictionary
27
28. Partitioning (boundary replication)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a triple is on the edge of two
partitions, it will be replicated in
both partitions. The size of this
boundary is defined by the DBA.
28
29. Partitioning (Data placement)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
The fragment hash will help on th
data placement. Based on the trip
and the size of the fragment,
Rendezvous will find the best
partition to store a triple.
29
31. Querying evaluation
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
}
P2
P1
P3
1. It will search for:
1.1. Simple queries
1.2. Star queries
1.3. Chain queries
2. Updates the Dataset
Characterizer
Chain:
Qc: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .
}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H
}
31
32. Querying decomposition
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
P2
P1
P3
Chain:
Q2c: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H }
D: db.partition2.find({
{p6:{$exists:true}, object:G},
{p7:{$exists:true}, object:I},
{p8:{$exists:true}, object:H},
})
Partition 1:
Cp1: SELECT S1, O1 FROM p1
Cp1: SELECT S2, O2 FROM p2
WHERE O=S1
Partition 3:
Cp3: SELECT S3,O3 FROM p3
WHERE O=S2
Find the right partition using the
dictionary and translates the SPARQL
query to the final query to be
processed by the NoSQL database.
32
34. Caching (two level cache)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
After the last query was issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
y? p5 w?
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
Normally, the near
cache is smaller
than the remote
cache.
34
35. Caching (querying)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
x? p1 y? .
y? p2 z? .
z? p3 w? .
y? p5 F
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
This query will be
solved only with
triples from cache 35
36. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
36
37. Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
37
38. Evaluation: Rendezvous performance
The bigger the number of hops (the replication),
the bigger (exponentially) the size of the dataset
and the loading time. However, as the joins are
avoided the response time decreases.
38
39. Evaluation: Rendezvous different settings
Better performance when the partition is
managed by Rendezvous.
The bigger is the boundary replication, the
faster is the response time, without a big impact
on the dataset size.
39
41. Conclusions
● Rendezvous contributes on:
○ Graph partitioning problem via fragments
○ Better query response time through n-hop and boundary replication
○ Better query response time via two-level caching
○ Scalable RDF storage provided by NoSQL databases
● About the evaluation:
○ Fragments are scalable
○ Bigger boundaries are not necessarily related to bigger storage size
○ Graph-aware partitions are better than NoSQL partitions
○ Near cache is fast but it makes more difficult to keep data consistency
41
42. Future Work
● Formalize the query mapping
○ No standard query language to rely on
● Compression of triples during the storage
● Update and delete operations
● Other NoSQL types (e.g., graph)
● Better datasets
42
43. Obrigado!
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
50. State of the Art - SQL Triplestores
WARP Hexastore YARS 4store SPIDER RDF-3x SHARd
SW-Store SOLID SPOVC S2X
50
51. State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
51
52. State of the Art - Triplestores
Recent survey (September 2017):
Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat,
Panos Kalnis: A Survey and Experimental
Comparison of Distributed SPARQL Engines for
Very Large RDF Data. Proceedings of the VLDB
Endowment, Volume 10, No. 13, September 2017,
2049 - 2060.
52