NoSQL Databases for RDF: An Empirical Evaluation
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

NoSQL Databases for RDF: An Empirical Evaluation

  • 687 views
Uploaded on

Processing large volumes of RDF data requires sophisticated tools. In recent years, much effort was spent on optimizing native RDF stores and on repurposing relational query engines for large-scale......

Processing large volumes of RDF data requires sophisticated tools. In recent years, much effort was spent on optimizing native RDF stores and on repurposing relational query engines for large-scale RDF processing. Concurrently, a number of new data management systems---regrouped under the NoSQL (for ``not only SQL'') umbrella---rapidly rose to prominence and represent today a popular alternative to classical databases. Though NoSQL systems are increasingly used to manage RDF data, it is still difficult to grasp their key advantages and drawbacks in this context. This work is, to the best of our knowledge, the first systematic attempt at characterizing and comparing NoSQL stores for RDF processing. In the following, we describe four different NoSQL stores and compare their key characteristics when running standard RDF benchmarks on a popular cloud infrastructure using both single-machine and distributed deployments.

http://ribs.csres.utexas.edu/nosqlrdf/

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
687
On Slideshare
680
From Embeds
7
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 7

http://exascale.info 7

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NoSQL Databases for RDF: An Empirical Evaluation Karlsruhe Institute of Technology VU University Amsterdam Andreas Harth Felix Leif Keppmann Sever Fundatureanu Paul Groth University of Texas at Austin University of Fribourg Albert Haque Daniel Miranker Juan Sequeda Philippe Cudre-Mauroux Iliya Enchev Marcin Wylot International Semantic Web Conference, 24th October 2013, Sydney, Australia
  • 2. Outline ➢ Motivation ➢ Systems; Experimental Scenario ➢ Results; Discussion; Analysis
  • 3. Motivation ➢ RDF merits the use of big data infrastructure ➢ commonly used for handling big data outside RDF space ➢ existing NoSQL infrastructures ➢ scalability ➢ storing RDF data in NoSQL Databases is thoroughly not explored
  • 4. What it is and it is NOT About it is about ➢ it is NOT about ➢ find differences between NoSQL and native systems ➢ find bottlenecks ➢ provide environment for replicable tests ➢ select an “overall winner” ➢ find commonalities across the performance profiles create new benchmark ➢ create a new system
  • 5. Systems ➢ four in development systems plus a native one ➢ variety of NoSQL system types Column Store Jena + HBase Query Translation Hive + HBase Document Store Couchbase Key-Value Store CumulusRDF (Casandra + Sesame) ➢ 4store, baseline, native distributed database
  • 6. Benchmarks Berlin SPARQL Benchmark ➢ 10,225,034 triples (scale factor: 28,850) ➢ 100,000,748 triples (scale factor: 284,826) ➢ 1,008,396,956 triples (scale factor: 2,878,260) DBPedia SPARQL Benchmark (ISWC 2011) ➢ 153,737,783 triples (scale factor: 100%)
  • 7. Experimental Settings Amazon EC2 Elastic Compute cloud Infrastructure ➢ instance type m1.large ➢ 64-bit platforms ➢ 2 virtual cores with 2 EC2 Compute Units each ➢ 7.5 GB of memory ➢ 850 GB of local storage Hadoop’s TeraSort ➢ 16 workers plus master node ➢ 1TB of data generated in 3,933 seconds (1.09 hour) ➢ 10 billion 100 byte records ➢ benchmark completed in 11,234 seconds (3.12 hours)
  • 8. Results - 4store more nodes reduces query time queries touching a lot of data performs slower (Q5, Q7) complex joins are challenging (Q7, Q8)
  • 9. Results - Jena + HBase highly selective queries: Q2, Q8, Q9, Q11, and Q12 low selective queries: Q1, Q3, and Q10 queries touching a lot of data: Q5 and Q7
  • 10. Results - Hive + HBase more nodes reduces query time MapReduce shuffle stage dominates the running time for more sparse dataset performs faster
  • 11. Results - Couchbase encounters problems while loading data on small clusters query execution time relatively fast with bigger clusters, the execution time is constant
  • 12. Results - CumulusRDF complex queries (Q1, Q3, Q4, Q5) are very challenging performance tends to decrease with cluster size increase joins requires comparison on strings
  • 13. Analysis ➢ distributed NoSQL can be competitive against distributed native RDF stores ➢ simple workloads perform really well ➢ more complex queries perform poorly ➢ classical query optimization techniques work well ➢ loading time varies depending on the system and indexing approach
  • 14. Conclusions http://ribs.csres.utexas.edu/nosqlrdf/ NoSQL systems represent a compelling alternative to native RDF stores for simple workload.