NoSQL Databases for RDF:
An Empirical Evaluation

Karlsruhe Institute of
Technology

VU University Amsterdam

Andreas Hart...
Outline

➢ Motivation
➢ Systems; Experimental Scenario
➢ Results; Discussion; Analysis
Motivation

➢ RDF merits the use of big data infrastructure
➢ commonly used for handling big data outside RDF space
➢ exis...
What it is and it is NOT About
it is about
➢

it is NOT about

➢

find differences between
NoSQL and native systems

➢

fi...
Systems
➢ four in development systems plus a native one
➢ variety of NoSQL system types
Column Store

Jena + HBase

Query ...
Benchmarks
Berlin SPARQL Benchmark
➢ 10,225,034 triples (scale factor: 28,850)
➢ 100,000,748 triples (scale factor: 284,82...
Experimental Settings
Amazon EC2 Elastic Compute cloud Infrastructure
➢ instance type m1.large
➢ 64-bit platforms
➢ 2 virt...
Results - 4store
more nodes reduces
query time
queries touching a lot
of data performs
slower (Q5, Q7)
complex joins are
c...
Results - Jena + HBase
highly selective
queries: Q2, Q8, Q9,
Q11, and Q12
low selective
queries: Q1, Q3, and
Q10
queries t...
Results - Hive + HBase
more nodes reduces
query time
MapReduce shuffle
stage dominates the
running time
for more sparse
da...
Results - Couchbase
encounters problems
while loading data
on small clusters
query execution time
relatively fast
with big...
Results - CumulusRDF
complex queries (Q1,
Q3, Q4, Q5) are very
challenging
performance tends to
decrease with cluster
size...
Analysis
➢ distributed NoSQL can be competitive
against distributed native RDF stores
➢ simple workloads perform really we...
Conclusions

http://ribs.csres.utexas.edu/nosqlrdf/

NoSQL systems represent a
compelling alternative to native RDF
stores...
Upcoming SlideShare
Loading in...5
×

NoSQL Databases for RDF: An Empirical Evaluation

642

Published on

Processing large volumes of RDF data requires sophisticated tools. In recent years, much effort was spent on optimizing native RDF stores and on repurposing relational query engines for large-scale RDF processing. Concurrently, a number of new data management systems---regrouped under the NoSQL (for ``not only SQL'') umbrella---rapidly rose to prominence and represent today a popular alternative to classical databases. Though NoSQL systems are increasingly used to manage RDF data, it is still difficult to grasp their key advantages and drawbacks in this context. This work is, to the best of our knowledge, the first systematic attempt at characterizing and comparing NoSQL stores for RDF processing. In the following, we describe four different NoSQL stores and compare their key characteristics when running standard RDF benchmarks on a popular cloud infrastructure using both single-machine and distributed deployments.

http://ribs.csres.utexas.edu/nosqlrdf/

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
642
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

NoSQL Databases for RDF: An Empirical Evaluation

  1. 1. NoSQL Databases for RDF: An Empirical Evaluation Karlsruhe Institute of Technology VU University Amsterdam Andreas Harth Felix Leif Keppmann Sever Fundatureanu Paul Groth University of Texas at Austin University of Fribourg Albert Haque Daniel Miranker Juan Sequeda Philippe Cudre-Mauroux Iliya Enchev Marcin Wylot International Semantic Web Conference, 24th October 2013, Sydney, Australia
  2. 2. Outline ➢ Motivation ➢ Systems; Experimental Scenario ➢ Results; Discussion; Analysis
  3. 3. Motivation ➢ RDF merits the use of big data infrastructure ➢ commonly used for handling big data outside RDF space ➢ existing NoSQL infrastructures ➢ scalability ➢ storing RDF data in NoSQL Databases is thoroughly not explored
  4. 4. What it is and it is NOT About it is about ➢ it is NOT about ➢ find differences between NoSQL and native systems ➢ find bottlenecks ➢ provide environment for replicable tests ➢ select an “overall winner” ➢ find commonalities across the performance profiles create new benchmark ➢ create a new system
  5. 5. Systems ➢ four in development systems plus a native one ➢ variety of NoSQL system types Column Store Jena + HBase Query Translation Hive + HBase Document Store Couchbase Key-Value Store CumulusRDF (Casandra + Sesame) ➢ 4store, baseline, native distributed database
  6. 6. Benchmarks Berlin SPARQL Benchmark ➢ 10,225,034 triples (scale factor: 28,850) ➢ 100,000,748 triples (scale factor: 284,826) ➢ 1,008,396,956 triples (scale factor: 2,878,260) DBPedia SPARQL Benchmark (ISWC 2011) ➢ 153,737,783 triples (scale factor: 100%)
  7. 7. Experimental Settings Amazon EC2 Elastic Compute cloud Infrastructure ➢ instance type m1.large ➢ 64-bit platforms ➢ 2 virtual cores with 2 EC2 Compute Units each ➢ 7.5 GB of memory ➢ 850 GB of local storage Hadoop’s TeraSort ➢ 16 workers plus master node ➢ 1TB of data generated in 3,933 seconds (1.09 hour) ➢ 10 billion 100 byte records ➢ benchmark completed in 11,234 seconds (3.12 hours)
  8. 8. Results - 4store more nodes reduces query time queries touching a lot of data performs slower (Q5, Q7) complex joins are challenging (Q7, Q8)
  9. 9. Results - Jena + HBase highly selective queries: Q2, Q8, Q9, Q11, and Q12 low selective queries: Q1, Q3, and Q10 queries touching a lot of data: Q5 and Q7
  10. 10. Results - Hive + HBase more nodes reduces query time MapReduce shuffle stage dominates the running time for more sparse dataset performs faster
  11. 11. Results - Couchbase encounters problems while loading data on small clusters query execution time relatively fast with bigger clusters, the execution time is constant
  12. 12. Results - CumulusRDF complex queries (Q1, Q3, Q4, Q5) are very challenging performance tends to decrease with cluster size increase joins requires comparison on strings
  13. 13. Analysis ➢ distributed NoSQL can be competitive against distributed native RDF stores ➢ simple workloads perform really well ➢ more complex queries perform poorly ➢ classical query optimization techniques work well ➢ loading time varies depending on the system and indexing approach
  14. 14. Conclusions http://ribs.csres.utexas.edu/nosqlrdf/ NoSQL systems represent a compelling alternative to native RDF stores for simple workload.
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×