SlideShare a Scribd company logo
1 of 52
Download to read offline
Workload-Aware RDF Partitioning and
SPARQL Query Caching for Massive RDF
Graphs stored in NoSQL Databases
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
2
Introduction: Motivation
● Since the of Semantic Web proposal in 2001, many advances introduced by
W3C
● RDF and SPARQL is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr
iteweb-how-best-buy-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201
3
○ US data.gov:
■ https://www.data.gov/developers/semantic-web
3
Introduction: Motivation (LOD stats)
4
Introduction: Motivation
● Research problem
○ Storing/querying large RDF graphs
■ No single node can handle the complete graph
■ Native RDF storage can’t scale to the current data
requirements
■ Inter-partitions joins is very costly
● Research hypothesis
○ A workload-aware approach based on distributed
polyglot NoSQL persistence could be a good solution 5
Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
Introduction: Contributions
● Mapping of RDF to columnar, document, and key/value NoSQL models;
● A workload-aware partitioner based on the current graph structure and,
mainly, on the typical application workload;
● A caching schema based on key/value databases for speeding up the query
response time;
● An experimental evaluation that compares the current version of our approach
against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache
Cassandra and MongoDB, the most popular key/value, columnar and
document NoSQL databases, respectively.
7
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
8
Background: RDF and SPARQL
9
Background: NoSQL
● No SQL interface
● No ACID transactions
● Very scalable
● Schemaless
https://db-engines.com/en/ranking
10
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
11
State of the Art - Triplestores
Triplestore Frag. Replication Partitioning Model In-memory Workload-aware
Hexastore (2008) No No No Native No No
SW-Store (2009) No No Vertical SQL No No
CumulusRDF
(2011)
No No Vertical Columnar
(Cassandra)
No No
SPOVC (2012) No No Horizontal Columnar
(MonetDB)
No No
WARP (2013) Yes N-hop replication on
partition boundary
Hash Native No Dynamic
Rainbow (2015) No No Hash Polyglot K/V cache Static
ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No
Rendezvous Yes N-hop replication fragment
and on partition boundary
V and H Polyglot K/V and local
cache
Dynamic
Key differentials
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
13
Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
14
Rendezvous: Architecture
15
Workload awareness Middleware core
16
Workload awareness
Given the graph:
If the following query is issued:
SELECT ?x WHERE {
B p2 C .
C p3 x?
}
SELECT ?x WHERE {
F p6 G .
F p9 L .
F p8 x?}
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
Indexed by the predicate
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Indexed by the subject/object
Dataset
Characterizer
17
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
18
Star Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
F C
p10
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
F tends to be in star queries
with diameter 1,
so we expand the triple
Fp10C to a 1-hop fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
Fp10C will be
stored
19
Star Fragmentation (mapping)
With the expanded fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
{
subject: F,
p6: G,
p7: I,
p8: H,
p10: C,
p9: L,
p5: {
object: B
}}
We translate it to a JSON
document:
Document
database
20
Chain Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
C G
p3
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
p3 tends to be in chain queries with
max-diameter 1, so we expand the
triple Cp3G to a 1-hop fragment
B
C
F G
p2
p3
p6
D
p3
Cp3G will be
stored
21
Chain Fragmentation (mapping)
With the expanded fragment We translate it to a set of
columnar tables:
B
C
F G
p2
p3
p6
D
p3
p2
Obj Subj
B C
p3
Obj Subj
C D
C G
p6
Obj Subj
F G
Columnar
database
22
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
23
Indexing
S_PO
...
F {p10C}
C {p3G}
O_SP
...
C {Fp10}
G {Gp3}
Indexer
Each new triple is indexed by the
subject and the object
It helps on a triple expansion, and to solve simple queries like:
SELECT ?x WHERE {F p10 x? }
24
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
25
Partitioning
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a graph is bigger than a
server capabilities, the
Rendezvous DBA can
create multiple partitions
Columnar
database
Document
database
P3
P1
P2
Each NoSQL server can hold one or
more partitions and each partition is
in only one server.
26
Partitioning
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
Rendezvous manages the
partitions by saving it on
the dictionary
27
Partitioning (boundary replication)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a triple is on the edge of two
partitions, it will be replicated in
both partitions. The size of this
boundary is defined by the DBA.
28
Partitioning (Data placement)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
The fragment hash will help on th
data placement. Based on the trip
and the size of the fragment,
Rendezvous will find the best
partition to store a triple.
29
Rendezvous: Querying
● Query evaluation
● Query decomposition
● Caching
30
Querying evaluation
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
}
P2
P1
P3
1. It will search for:
1.1. Simple queries
1.2. Star queries
1.3. Chain queries
2. Updates the Dataset
Characterizer
Chain:
Qc: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .
}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H
}
31
Querying decomposition
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
P2
P1
P3
Chain:
Q2c: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H }
D: db.partition2.find({
{p6:{$exists:true}, object:G},
{p7:{$exists:true}, object:I},
{p8:{$exists:true}, object:H},
})
Partition 1:
Cp1: SELECT S1, O1 FROM p1
Cp1: SELECT S2, O2 FROM p2
WHERE O=S1
Partition 3:
Cp3: SELECT S3,O3 FROM p3
WHERE O=S2
Find the right partition using the
dictionary and translates the SPARQL
query to the final query to be
processed by the NoSQL database.
32
Rendezvous: Querying
● Query evaluation
● Query decomposition
● Caching
33
Caching (two level cache)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
After the last query was issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
y? p5 w?
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
Normally, the near
cache is smaller
than the remote
cache.
34
Caching (querying)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
x? p1 y? .
y? p2 z? .
z? p3 w? .
y? p5 F
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
This query will be
solved only with
triples from cache 35
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
36
Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
37
Evaluation: Rendezvous performance
The bigger the number of hops (the replication),
the bigger (exponentially) the size of the dataset
and the loading time. However, as the joins are
avoided the response time decreases.
38
Evaluation: Rendezvous different settings
Better performance when the partition is
managed by Rendezvous.
The bigger is the boundary replication, the
faster is the response time, without a big impact
on the dataset size.
39
Evaluation: Rendezvous vs. ScalaRDF
40
Conclusions
● Rendezvous contributes on:
○ Graph partitioning problem via fragments
○ Better query response time through n-hop and boundary replication
○ Better query response time via two-level caching
○ Scalable RDF storage provided by NoSQL databases
● About the evaluation:
○ Fragments are scalable
○ Bigger boundaries are not necessarily related to bigger storage size
○ Graph-aware partitions are better than NoSQL partitions
○ Near cache is fast but it makes more difficult to keep data consistency
41
Future Work
● Formalize the query mapping
○ No standard query language to rely on
● Compression of triples during the storage
● Update and delete operations
● Other NoSQL types (e.g., graph)
● Better datasets
42
Obrigado!
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
LUBM model
44
Storing: Fragmentation
45
Storing: Fragmentation
46
Storing: Fragmentation
47
Storing: Partitioning
48
Evaluation: Rendezvous vs. Rainbow
49
State of the Art - SQL Triplestores
WARP Hexastore YARS 4store SPIDER RDF-3x SHARd
SW-Store SOLID SPOVC S2X
50
State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
51
State of the Art - Triplestores
Recent survey (September 2017):
Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat,
Panos Kalnis: A Survey and Experimental
Comparison of Distributed SPARQL Engines for
Very Large RDF Data. Proceedings of the VLDB
Endowment, Volume 10, No. 13, September 2017,
2049 - 2060.
52

More Related Content

What's hot

Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLGiuseppe Broccolo
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013Juan Sequeda
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with Ramsantac
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Olaf Hartig
 
Filelist
FilelistFilelist
FilelistNeelBca
 

What's hot (7)

Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQL
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
 
Linked Census Data
Linked Census DataLinked Census Data
Linked Census Data
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
 
Omid: A transactional Framework for HBase
Omid: A transactional Framework for HBaseOmid: A transactional Framework for HBase
Omid: A transactional Framework for HBase
 
Filelist
FilelistFilelist
Filelist
 

Similar to Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLLuiz Henrique Zambom Santana
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)PingCAP
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQLPingCAP
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big DataPingCAP
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016Duyhai Doan
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016Duyhai Doan
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsDr. Neil Brittliff
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 

Similar to Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases (20)

A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxeSFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 

More from Luiz Henrique Zambom Santana

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Luiz Henrique Zambom Santana
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkLuiz Henrique Zambom Santana
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeLuiz Henrique Zambom Santana
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLLuiz Henrique Zambom Santana
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchLuiz Henrique Zambom Santana
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Luiz Henrique Zambom Santana
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPLuiz Henrique Zambom Santana
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesAn Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesLuiz Henrique Zambom Santana
 

More from Luiz Henrique Zambom Santana (20)

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipe
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
 
IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
 
Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3
 
Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2
 
Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1
 
Normalização
NormalizaçãoNormalização
Normalização
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Consultas básicas em SQL
Consultas básicas em SQLConsultas básicas em SQL
Consultas básicas em SQL
 
Processamento em Big Data
Processamento em Big DataProcessamento em Big Data
Processamento em Big Data
 
Seminário de Andamento de Doutorado
Seminário de Andamento de DoutoradoSeminário de Andamento de Doutorado
Seminário de Andamento de Doutorado
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
 
Workshop de ELK - EmergiNet
Workshop de ELK - EmergiNetWorkshop de ELK - EmergiNet
Workshop de ELK - EmergiNet
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHP
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesAn Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL Repositories
 
Survey on NoSQL integration
Survey on NoSQL integrationSurvey on NoSQL integration
Survey on NoSQL integration
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

  • 1. Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases Simpósio Brasileiro de Banco de Dados (SBBD) Uberlândia, Outubro/2017 Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello
  • 2. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 2
  • 3. Introduction: Motivation ● Since the of Semantic Web proposal in 2001, many advances introduced by W3C ● RDF and SPARQL is currently widespread: ○ Best buy: ■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr iteweb-how-best-buy-is-using-the-semantic-web-23031.html ○ Globo.com: ■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201 3 ○ US data.gov: ■ https://www.data.gov/developers/semantic-web 3
  • 5. Introduction: Motivation ● Research problem ○ Storing/querying large RDF graphs ■ No single node can handle the complete graph ■ Native RDF storage can’t scale to the current data requirements ■ Inter-partitions joins is very costly ● Research hypothesis ○ A workload-aware approach based on distributed polyglot NoSQL persistence could be a good solution 5
  • 6. Rendezvous ● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases ● Novel data partitioning approach ● Fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models ● Caching structure that accelerate the querying response
  • 7. Introduction: Contributions ● Mapping of RDF to columnar, document, and key/value NoSQL models; ● A workload-aware partitioner based on the current graph structure and, mainly, on the typical application workload; ● A caching schema based on key/value databases for speeding up the query response time; ● An experimental evaluation that compares the current version of our approach against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively. 7
  • 8. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 8
  • 10. Background: NoSQL ● No SQL interface ● No ACID transactions ● Very scalable ● Schemaless https://db-engines.com/en/ranking 10
  • 11. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule 11
  • 12. State of the Art - Triplestores Triplestore Frag. Replication Partitioning Model In-memory Workload-aware Hexastore (2008) No No No Native No No SW-Store (2009) No No Vertical SQL No No CumulusRDF (2011) No No Vertical Columnar (Cassandra) No No SPOVC (2012) No No Horizontal Columnar (MonetDB) No No WARP (2013) Yes N-hop replication on partition boundary Hash Native No Dynamic Rainbow (2015) No No Hash Polyglot K/V cache Static ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No Rendezvous Yes N-hop replication fragment and on partition boundary V and H Polyglot K/V and local cache Dynamic Key differentials
  • 13. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule 13
  • 14. Rendezvous ● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases ● Novel data partitioning approach ● Fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models ● Caching structure that accelerate the querying response 14
  • 17. Workload awareness Given the graph: If the following query is issued: SELECT ?x WHERE { B p2 C . C p3 x? } SELECT ?x WHERE { F p6 G . F p9 L . F p8 x?} A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} Indexed by the predicate Chain-shaped ... ... p3 {Bp2C, Cp3?} Indexed by the subject/object Dataset Characterizer 17
  • 18. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 18
  • 19. Star Fragmentation (n-hop expansion) Given the graph and this state on Dataset Characterizer A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D F C p10 Chain-shaped ... ... p3 {Bp2C, Cp3?} Dataset Characterizer Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} F tends to be in star queries with diameter 1, so we expand the triple Fp10C to a 1-hop fragment B C F G LHI p5 p6 p7 p9 p8 p10 Fp10C will be stored 19
  • 20. Star Fragmentation (mapping) With the expanded fragment B C F G LHI p5 p6 p7 p9 p8 p10 { subject: F, p6: G, p7: I, p8: H, p10: C, p9: L, p5: { object: B }} We translate it to a JSON document: Document database 20
  • 21. Chain Fragmentation (n-hop expansion) Given the graph and this state on Dataset Characterizer A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D C G p3 Chain-shaped ... ... p3 {Bp2C, Cp3?} Dataset Characterizer Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} p3 tends to be in chain queries with max-diameter 1, so we expand the triple Cp3G to a 1-hop fragment B C F G p2 p3 p6 D p3 Cp3G will be stored 21
  • 22. Chain Fragmentation (mapping) With the expanded fragment We translate it to a set of columnar tables: B C F G p2 p3 p6 D p3 p2 Obj Subj B C p3 Obj Subj C D C G p6 Obj Subj F G Columnar database 22
  • 23. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 23
  • 24. Indexing S_PO ... F {p10C} C {p3G} O_SP ... C {Fp10} G {Gp3} Indexer Each new triple is indexed by the subject and the object It helps on a triple expansion, and to solve simple queries like: SELECT ?x WHERE {F p10 x? } 24
  • 25. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 25
  • 26. Partitioning A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 If a graph is bigger than a server capabilities, the Rendezvous DBA can create multiple partitions Columnar database Document database P3 P1 P2 Each NoSQL server can hold one or more partitions and each partition is in only one server. 26
  • 27. Partitioning Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 Rendezvous manages the partitions by saving it on the dictionary 27
  • 28. Partitioning (boundary replication) Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 If a triple is on the edge of two partitions, it will be replicated in both partitions. The size of this boundary is defined by the DBA. 28
  • 29. Partitioning (Data placement) Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 The fragment hash will help on th data placement. Based on the trip and the size of the fragment, Rendezvous will find the best partition to store a triple. 29
  • 30. Rendezvous: Querying ● Query evaluation ● Query decomposition ● Caching 30
  • 31. Querying evaluation Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D If the following query is issued: Q: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H . x? p1 y? . y? p2 z? . z? p3 w? } P2 P1 P3 1. It will search for: 1.1. Simple queries 1.2. Star queries 1.3. Chain queries 2. Updates the Dataset Characterizer Chain: Qc: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . } Star: Qs: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H } 31
  • 32. Querying decomposition Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D P2 P1 P3 Chain: Q2c: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? .} Star: Qs: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H } D: db.partition2.find({ {p6:{$exists:true}, object:G}, {p7:{$exists:true}, object:I}, {p8:{$exists:true}, object:H}, }) Partition 1: Cp1: SELECT S1, O1 FROM p1 Cp1: SELECT S2, O2 FROM p2 WHERE O=S1 Partition 3: Cp3: SELECT S3,O3 FROM p3 WHERE O=S2 Find the right partition using the dictionary and translates the SPARQL query to the final query to be processed by the NoSQL database. 32
  • 33. Rendezvous: Querying ● Query evaluation ● Query decomposition ● Caching 33
  • 34. Caching (two level cache) Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D After the last query was issued: Q: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H . x? p1 y? . y? p2 z? . z? p3 w? y? p5 w? } P2 P1 P3 Near cache (in-memory tree map( A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} Remote cache (key/value NoSQL database) ... A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} ... B:p5:F {B:p5:F, F:p9:D} Normally, the near cache is smaller than the remote cache. 34
  • 35. Caching (querying) Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D If the following query is issued: Q: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . y? p5 F } P2 P1 P3 Near cache (in-memory tree map( A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} Remote cache (key/value NoSQL database) ... A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} ... B:p5:F {B:p5:F, F:p9:D} This query will be solved only with triples from cache 35
  • 36. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 36
  • 37. Evaluation ● LUBM: ontology for the University domain, synthetic RDF data scalable to any size, and 14 extensional queries representing a variety of properties ● Generated dataset with 4000 universities (around 100 GB and contains around 500 million triples) ● 12 queries with joins, all of them have at least one subject-subject join, and six of them also have at least one subject-object join ● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB 3.4.3, and Apache Cassandra 3.10 ● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity 37
  • 38. Evaluation: Rendezvous performance The bigger the number of hops (the replication), the bigger (exponentially) the size of the dataset and the loading time. However, as the joins are avoided the response time decreases. 38
  • 39. Evaluation: Rendezvous different settings Better performance when the partition is managed by Rendezvous. The bigger is the boundary replication, the faster is the response time, without a big impact on the dataset size. 39
  • 41. Conclusions ● Rendezvous contributes on: ○ Graph partitioning problem via fragments ○ Better query response time through n-hop and boundary replication ○ Better query response time via two-level caching ○ Scalable RDF storage provided by NoSQL databases ● About the evaluation: ○ Fragments are scalable ○ Bigger boundaries are not necessarily related to bigger storage size ○ Graph-aware partitions are better than NoSQL partitions ○ Near cache is fast but it makes more difficult to keep data consistency 41
  • 42. Future Work ● Formalize the query mapping ○ No standard query language to rely on ● Compression of triples during the storage ● Update and delete operations ● Other NoSQL types (e.g., graph) ● Better datasets 42
  • 43. Obrigado! Simpósio Brasileiro de Banco de Dados (SBBD) Uberlândia, Outubro/2017 Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello
  • 50. State of the Art - SQL Triplestores WARP Hexastore YARS 4store SPIDER RDF-3x SHARd SW-Store SOLID SPOVC S2X 50
  • 51. State of the Art - NoSQL Triplestores RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF, H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF. 51
  • 52. State of the Art - Triplestores Recent survey (September 2017): Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, Panos Kalnis: A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data. Proceedings of the VLDB Endowment, Volume 10, No. 13, September 2017, 2049 - 2060. 52