SlideShare a Scribd company logo
Workload-Aware RDF Partitioning and
SPARQL Query Caching for Massive RDF
Graphs stored in NoSQL Databases
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
2
Introduction: Motivation
● Since the of Semantic Web proposal in 2001, many advances introduced by
W3C
● RDF and SPARQL is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr
iteweb-how-best-buy-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201
3
○ US data.gov:
■ https://www.data.gov/developers/semantic-web
3
Introduction: Motivation (LOD stats)
4
Introduction: Motivation
● Research problem
○ Storing/querying large RDF graphs
■ No single node can handle the complete graph
■ Native RDF storage can’t scale to the current data
requirements
■ Inter-partitions joins is very costly
● Research hypothesis
○ A workload-aware approach based on distributed
polyglot NoSQL persistence could be a good solution 5
Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
Introduction: Contributions
● Mapping of RDF to columnar, document, and key/value NoSQL models;
● A workload-aware partitioner based on the current graph structure and,
mainly, on the typical application workload;
● A caching schema based on key/value databases for speeding up the query
response time;
● An experimental evaluation that compares the current version of our approach
against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache
Cassandra and MongoDB, the most popular key/value, columnar and
document NoSQL databases, respectively.
7
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
8
Background: RDF and SPARQL
9
Background: NoSQL
● No SQL interface
● No ACID transactions
● Very scalable
● Schemaless
https://db-engines.com/en/ranking
10
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
11
State of the Art - Triplestores
Triplestore Frag. Replication Partitioning Model In-memory Workload-aware
Hexastore (2008) No No No Native No No
SW-Store (2009) No No Vertical SQL No No
CumulusRDF
(2011)
No No Vertical Columnar
(Cassandra)
No No
SPOVC (2012) No No Horizontal Columnar
(MonetDB)
No No
WARP (2013) Yes N-hop replication on
partition boundary
Hash Native No Dynamic
Rainbow (2015) No No Hash Polyglot K/V cache Static
ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No
Rendezvous Yes N-hop replication fragment
and on partition boundary
V and H Polyglot K/V and local
cache
Dynamic
Key differentials
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
13
Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
14
Rendezvous: Architecture
15
Workload awareness Middleware core
16
Workload awareness
Given the graph:
If the following query is issued:
SELECT ?x WHERE {
B p2 C .
C p3 x?
}
SELECT ?x WHERE {
F p6 G .
F p9 L .
F p8 x?}
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
Indexed by the predicate
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Indexed by the subject/object
Dataset
Characterizer
17
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
18
Star Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
F C
p10
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
F tends to be in star queries
with diameter 1,
so we expand the triple
Fp10C to a 1-hop fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
Fp10C will be
stored
19
Star Fragmentation (mapping)
With the expanded fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
{
subject: F,
p6: G,
p7: I,
p8: H,
p10: C,
p9: L,
p5: {
object: B
}}
We translate it to a JSON
document:
Document
database
20
Chain Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
C G
p3
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
p3 tends to be in chain queries with
max-diameter 1, so we expand the
triple Cp3G to a 1-hop fragment
B
C
F G
p2
p3
p6
D
p3
Cp3G will be
stored
21
Chain Fragmentation (mapping)
With the expanded fragment We translate it to a set of
columnar tables:
B
C
F G
p2
p3
p6
D
p3
p2
Obj Subj
B C
p3
Obj Subj
C D
C G
p6
Obj Subj
F G
Columnar
database
22
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
23
Indexing
S_PO
...
F {p10C}
C {p3G}
O_SP
...
C {Fp10}
G {Gp3}
Indexer
Each new triple is indexed by the
subject and the object
It helps on a triple expansion, and to solve simple queries like:
SELECT ?x WHERE {F p10 x? }
24
Rendezvous: Storing
● Fragmentation
and Mapping
● Indexing
● Partitioning
25
Partitioning
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a graph is bigger than a
server capabilities, the
Rendezvous DBA can
create multiple partitions
Columnar
database
Document
database
P3
P1
P2
Each NoSQL server can hold one or
more partitions and each partition is
in only one server.
26
Partitioning
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
Rendezvous manages the
partitions by saving it on
the dictionary
27
Partitioning (boundary replication)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a triple is on the edge of two
partitions, it will be replicated in
both partitions. The size of this
boundary is defined by the DBA.
28
Partitioning (Data placement)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
The fragment hash will help on th
data placement. Based on the trip
and the size of the fragment,
Rendezvous will find the best
partition to store a triple.
29
Rendezvous: Querying
● Query evaluation
● Query decomposition
● Caching
30
Querying evaluation
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
}
P2
P1
P3
1. It will search for:
1.1. Simple queries
1.2. Star queries
1.3. Chain queries
2. Updates the Dataset
Characterizer
Chain:
Qc: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .
}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H
}
31
Querying decomposition
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
P2
P1
P3
Chain:
Q2c: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H }
D: db.partition2.find({
{p6:{$exists:true}, object:G},
{p7:{$exists:true}, object:I},
{p8:{$exists:true}, object:H},
})
Partition 1:
Cp1: SELECT S1, O1 FROM p1
Cp1: SELECT S2, O2 FROM p2
WHERE O=S1
Partition 3:
Cp3: SELECT S3,O3 FROM p3
WHERE O=S2
Find the right partition using the
dictionary and translates the SPARQL
query to the final query to be
processed by the NoSQL database.
32
Rendezvous: Querying
● Query evaluation
● Query decomposition
● Caching
33
Caching (two level cache)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
After the last query was issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
y? p5 w?
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
Normally, the near
cache is smaller
than the remote
cache.
34
Caching (querying)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
x? p1 y? .
y? p2 z? .
z? p3 w? .
y? p5 F
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
This query will be
solved only with
triples from cache 35
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
36
Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
37
Evaluation: Rendezvous performance
The bigger the number of hops (the replication),
the bigger (exponentially) the size of the dataset
and the loading time. However, as the joins are
avoided the response time decreases.
38
Evaluation: Rendezvous different settings
Better performance when the partition is
managed by Rendezvous.
The bigger is the boundary replication, the
faster is the response time, without a big impact
on the dataset size.
39
Evaluation: Rendezvous vs. ScalaRDF
40
Conclusions
● Rendezvous contributes on:
○ Graph partitioning problem via fragments
○ Better query response time through n-hop and boundary replication
○ Better query response time via two-level caching
○ Scalable RDF storage provided by NoSQL databases
● About the evaluation:
○ Fragments are scalable
○ Bigger boundaries are not necessarily related to bigger storage size
○ Graph-aware partitions are better than NoSQL partitions
○ Near cache is fast but it makes more difficult to keep data consistency
41
Future Work
● Formalize the query mapping
○ No standard query language to rely on
● Compression of triples during the storage
● Update and delete operations
● Other NoSQL types (e.g., graph)
● Better datasets
42
Obrigado!
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
LUBM model
44
Storing: Fragmentation
45
Storing: Fragmentation
46
Storing: Fragmentation
47
Storing: Partitioning
48
Evaluation: Rendezvous vs. Rainbow
49
State of the Art - SQL Triplestores
WARP Hexastore YARS 4store SPIDER RDF-3x SHARd
SW-Store SOLID SPOVC S2X
50
State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
51
State of the Art - Triplestores
Recent survey (September 2017):
Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat,
Panos Kalnis: A Survey and Experimental
Comparison of Distributed SPARQL Engines for
Very Large RDF Data. Proceedings of the VLDB
Endowment, Volume 10, No. 13, September 2017,
2049 - 2060.
52

More Related Content

What's hot

Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQL
Giuseppe Broccolo
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
Juan Sequeda
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
amsantac
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Olaf Hartig
 
Omid: A transactional Framework for HBase
Omid: A transactional Framework for HBaseOmid: A transactional Framework for HBase
Omid: A transactional Framework for HBase
Francisco Pérez-Sorrosal
 
Filelist
FilelistFilelist
FilelistNeelBca
 

What's hot (7)

Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQL
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
 
Linked Census Data
Linked Census DataLinked Census Data
Linked Census Data
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
 
Omid: A transactional Framework for HBase
Omid: A transactional Framework for HBaseOmid: A transactional Framework for HBase
Omid: A transactional Framework for HBase
 
Filelist
FilelistFilelist
Filelist
 

Similar to Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
Luiz Henrique Zambom Santana
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
PingCAP
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
PingCAP
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
Duyhai Doan
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
ArangoDB Database
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
Duyhai Doan
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxeSFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
South Tyrol Free Software Conference
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
Accumulo Summit
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Nicola Cadenelli
 

Similar to Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases (20)

A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxeSFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 

More from Luiz Henrique Zambom Santana

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Luiz Henrique Zambom Santana
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
Luiz Henrique Zambom Santana
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipe
Luiz Henrique Zambom Santana
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
Luiz Henrique Zambom Santana
 
IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?
Luiz Henrique Zambom Santana
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1
Luiz Henrique Zambom Santana
 
Normalização
NormalizaçãoNormalização
Consultas básicas em SQL
Consultas básicas em SQLConsultas básicas em SQL
Consultas básicas em SQL
Luiz Henrique Zambom Santana
 
Processamento em Big Data
Processamento em Big DataProcessamento em Big Data
Processamento em Big Data
Luiz Henrique Zambom Santana
 
Seminário de Andamento de Doutorado
Seminário de Andamento de DoutoradoSeminário de Andamento de Doutorado
Seminário de Andamento de Doutorado
Luiz Henrique Zambom Santana
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Luiz Henrique Zambom Santana
 
Workshop de ELK - EmergiNet
Workshop de ELK - EmergiNetWorkshop de ELK - EmergiNet
Workshop de ELK - EmergiNet
Luiz Henrique Zambom Santana
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Luiz Henrique Zambom Santana
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
Luiz Henrique Zambom Santana
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesAn Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL Repositories
Luiz Henrique Zambom Santana
 
Survey on NoSQL integration
Survey on NoSQL integrationSurvey on NoSQL integration
Survey on NoSQL integration
Luiz Henrique Zambom Santana
 

More from Luiz Henrique Zambom Santana (20)

Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
 
De Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipeDe Arquiteto para Gerente: como debugar uma equipe
De Arquiteto para Gerente: como debugar uma equipe
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQLVoltDB: as vantagens e os desafios dos banco de dados NewSQL
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
 
IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?IBM Watson, Apache Spark ou TensorFlow?
IBM Watson, Apache Spark ou TensorFlow?
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e ElasticsearchUma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
 
Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3Banco de dados nas nuvens - aula 3
Banco de dados nas nuvens - aula 3
 
Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2Banco de dados nas nuvens - aula 2
Banco de dados nas nuvens - aula 2
 
Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1Banco de dados nas nuvens - aula 1
Banco de dados nas nuvens - aula 1
 
Normalização
NormalizaçãoNormalização
Normalização
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Consultas básicas em SQL
Consultas básicas em SQLConsultas básicas em SQL
Consultas básicas em SQL
 
Processamento em Big Data
Processamento em Big DataProcessamento em Big Data
Processamento em Big Data
 
Seminário de Andamento de Doutorado
Seminário de Andamento de DoutoradoSeminário de Andamento de Doutorado
Seminário de Andamento de Doutorado
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
 
Workshop de ELK - EmergiNet
Workshop de ELK - EmergiNetWorkshop de ELK - EmergiNet
Workshop de ELK - EmergiNet
 
Novidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHPNovidades do elasticsearch 2.0 e como usá-lo com PHP
Novidades do elasticsearch 2.0 e como usá-lo com PHP
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesAn Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL Repositories
 
Survey on NoSQL integration
Survey on NoSQL integrationSurvey on NoSQL integration
Survey on NoSQL integration
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

  • 1. Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases Simpósio Brasileiro de Banco de Dados (SBBD) Uberlândia, Outubro/2017 Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello
  • 2. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 2
  • 3. Introduction: Motivation ● Since the of Semantic Web proposal in 2001, many advances introduced by W3C ● RDF and SPARQL is currently widespread: ○ Best buy: ■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr iteweb-how-best-buy-is-using-the-semantic-web-23031.html ○ Globo.com: ■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201 3 ○ US data.gov: ■ https://www.data.gov/developers/semantic-web 3
  • 5. Introduction: Motivation ● Research problem ○ Storing/querying large RDF graphs ■ No single node can handle the complete graph ■ Native RDF storage can’t scale to the current data requirements ■ Inter-partitions joins is very costly ● Research hypothesis ○ A workload-aware approach based on distributed polyglot NoSQL persistence could be a good solution 5
  • 6. Rendezvous ● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases ● Novel data partitioning approach ● Fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models ● Caching structure that accelerate the querying response
  • 7. Introduction: Contributions ● Mapping of RDF to columnar, document, and key/value NoSQL models; ● A workload-aware partitioner based on the current graph structure and, mainly, on the typical application workload; ● A caching schema based on key/value databases for speeding up the query response time; ● An experimental evaluation that compares the current version of our approach against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively. 7
  • 8. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 8
  • 10. Background: NoSQL ● No SQL interface ● No ACID transactions ● Very scalable ● Schemaless https://db-engines.com/en/ranking 10
  • 11. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule 11
  • 12. State of the Art - Triplestores Triplestore Frag. Replication Partitioning Model In-memory Workload-aware Hexastore (2008) No No No Native No No SW-Store (2009) No No Vertical SQL No No CumulusRDF (2011) No No Vertical Columnar (Cassandra) No No SPOVC (2012) No No Horizontal Columnar (MonetDB) No No WARP (2013) Yes N-hop replication on partition boundary Hash Native No Dynamic Rainbow (2015) No No Hash Polyglot K/V cache Static ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No Rendezvous Yes N-hop replication fragment and on partition boundary V and H Polyglot K/V and local cache Dynamic Key differentials
  • 13. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule 13
  • 14. Rendezvous ● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases ● Novel data partitioning approach ● Fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models ● Caching structure that accelerate the querying response 14
  • 17. Workload awareness Given the graph: If the following query is issued: SELECT ?x WHERE { B p2 C . C p3 x? } SELECT ?x WHERE { F p6 G . F p9 L . F p8 x?} A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} Indexed by the predicate Chain-shaped ... ... p3 {Bp2C, Cp3?} Indexed by the subject/object Dataset Characterizer 17
  • 18. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 18
  • 19. Star Fragmentation (n-hop expansion) Given the graph and this state on Dataset Characterizer A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D F C p10 Chain-shaped ... ... p3 {Bp2C, Cp3?} Dataset Characterizer Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} F tends to be in star queries with diameter 1, so we expand the triple Fp10C to a 1-hop fragment B C F G LHI p5 p6 p7 p9 p8 p10 Fp10C will be stored 19
  • 20. Star Fragmentation (mapping) With the expanded fragment B C F G LHI p5 p6 p7 p9 p8 p10 { subject: F, p6: G, p7: I, p8: H, p10: C, p9: L, p5: { object: B }} We translate it to a JSON document: Document database 20
  • 21. Chain Fragmentation (n-hop expansion) Given the graph and this state on Dataset Characterizer A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D C G p3 Chain-shaped ... ... p3 {Bp2C, Cp3?} Dataset Characterizer Star-shaped ... ... F {Fp6G, Fp9L, Fp8?} p3 tends to be in chain queries with max-diameter 1, so we expand the triple Cp3G to a 1-hop fragment B C F G p2 p3 p6 D p3 Cp3G will be stored 21
  • 22. Chain Fragmentation (mapping) With the expanded fragment We translate it to a set of columnar tables: B C F G p2 p3 p6 D p3 p2 Obj Subj B C p3 Obj Subj C D C G p6 Obj Subj F G Columnar database 22
  • 23. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 23
  • 24. Indexing S_PO ... F {p10C} C {p3G} O_SP ... C {Fp10} G {Gp3} Indexer Each new triple is indexed by the subject and the object It helps on a triple expansion, and to solve simple queries like: SELECT ?x WHERE {F p10 x? } 24
  • 25. Rendezvous: Storing ● Fragmentation and Mapping ● Indexing ● Partitioning 25
  • 26. Partitioning A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 If a graph is bigger than a server capabilities, the Rendezvous DBA can create multiple partitions Columnar database Document database P3 P1 P2 Each NoSQL server can hold one or more partitions and each partition is in only one server. 26
  • 27. Partitioning Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 Rendezvous manages the partitions by saving it on the dictionary 27
  • 28. Partitioning (boundary replication) Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 If a triple is on the edge of two partitions, it will be replicated in both partitions. The size of this boundary is defined by the DBA. 28
  • 29. Partitioning (Data placement) Fragments hash (F p10 C) Size: 2 {P1 , P2 } (C p3 D) Size: 2 {P3 } (L p12 H) Size: 1 {P2 } P3 Elements S P O C p3 D ... ... ... P1 Elements S P O A p1 B F p10 C ... Dictionary P2 Elements S P O F p10 C L p12 J ... ... H (vi) Columnar database Columnar database Document database P3 Pn P1 P2 A B C E F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 P2 P1 J p11 D P3 p10 The fragment hash will help on th data placement. Based on the trip and the size of the fragment, Rendezvous will find the best partition to store a triple. 29
  • 30. Rendezvous: Querying ● Query evaluation ● Query decomposition ● Caching 30
  • 31. Querying evaluation Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D If the following query is issued: Q: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H . x? p1 y? . y? p2 z? . z? p3 w? } P2 P1 P3 1. It will search for: 1.1. Simple queries 1.2. Star queries 1.3. Chain queries 2. Updates the Dataset Characterizer Chain: Qc: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . } Star: Qs: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H } 31
  • 32. Querying decomposition Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D P2 P1 P3 Chain: Q2c: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? .} Star: Qs: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H } D: db.partition2.find({ {p6:{$exists:true}, object:G}, {p7:{$exists:true}, object:I}, {p8:{$exists:true}, object:H}, }) Partition 1: Cp1: SELECT S1, O1 FROM p1 Cp1: SELECT S2, O2 FROM p2 WHERE O=S1 Partition 3: Cp3: SELECT S3,O3 FROM p3 WHERE O=S2 Find the right partition using the dictionary and translates the SPARQL query to the final query to be processed by the NoSQL database. 32
  • 33. Rendezvous: Querying ● Query evaluation ● Query decomposition ● Caching 33
  • 34. Caching (two level cache) Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D After the last query was issued: Q: SELECT ?x WHERE { w? p6 G . w? p7 I . w? p8 H . x? p1 y? . y? p2 z? . z? p3 w? y? p5 w? } P2 P1 P3 Near cache (in-memory tree map( A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} Remote cache (key/value NoSQL database) ... A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} ... B:p5:F {B:p5:F, F:p9:D} Normally, the near cache is smaller than the remote cache. 34
  • 35. Caching (querying) Given the graph: A B C M F G LHI p1 p2 p3 p4 p5 p6 p7 p9 p8 J p11 D If the following query is issued: Q: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . y? p5 F } P2 P1 P3 Near cache (in-memory tree map( A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} Remote cache (key/value NoSQL database) ... A:p1:B {A:p1:B, B:p2:C} B:p2:C {B:p2:C, C:p3:D} ... B:p5:F {B:p5:F, F:p9:D} This query will be solved only with triples from cache 35
  • 36. Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation 36
  • 37. Evaluation ● LUBM: ontology for the University domain, synthetic RDF data scalable to any size, and 14 extensional queries representing a variety of properties ● Generated dataset with 4000 universities (around 100 GB and contains around 500 million triples) ● 12 queries with joins, all of them have at least one subject-subject join, and six of them also have at least one subject-object join ● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB 3.4.3, and Apache Cassandra 3.10 ● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity 37
  • 38. Evaluation: Rendezvous performance The bigger the number of hops (the replication), the bigger (exponentially) the size of the dataset and the loading time. However, as the joins are avoided the response time decreases. 38
  • 39. Evaluation: Rendezvous different settings Better performance when the partition is managed by Rendezvous. The bigger is the boundary replication, the faster is the response time, without a big impact on the dataset size. 39
  • 41. Conclusions ● Rendezvous contributes on: ○ Graph partitioning problem via fragments ○ Better query response time through n-hop and boundary replication ○ Better query response time via two-level caching ○ Scalable RDF storage provided by NoSQL databases ● About the evaluation: ○ Fragments are scalable ○ Bigger boundaries are not necessarily related to bigger storage size ○ Graph-aware partitions are better than NoSQL partitions ○ Near cache is fast but it makes more difficult to keep data consistency 41
  • 42. Future Work ● Formalize the query mapping ○ No standard query language to rely on ● Compression of triples during the storage ● Update and delete operations ● Other NoSQL types (e.g., graph) ● Better datasets 42
  • 43. Obrigado! Simpósio Brasileiro de Banco de Dados (SBBD) Uberlândia, Outubro/2017 Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello
  • 50. State of the Art - SQL Triplestores WARP Hexastore YARS 4store SPIDER RDF-3x SHARd SW-Store SOLID SPOVC S2X 50
  • 51. State of the Art - NoSQL Triplestores RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF, H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF. 51
  • 52. State of the Art - Triplestores Recent survey (September 2017): Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, Panos Kalnis: A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data. Proceedings of the VLDB Endowment, Volume 10, No. 13, September 2017, 2049 - 2060. 52