1. Building a Graph based RDF Store for Apache Cassandra
Name: Ravindra Ranwala
ID: 138227T
Supervisor: Dr. Amal Shehan Perera
1
2. Agenda
● Introduction
● Basic Concepts
● The Problem
● Literature Review
● Methodology
● Demo
● Evaluation and Result
● Conclusion 2
3. Introduction
● RDFs are used to support queries in the semantic web.
● RDF stores contain trillions of triples.
● Today RDF data is everywhere - commercial search
engines proliferate RDF data ex. Google, yahoo, bing
etc.
● SPARQL - used as a query language.
● Different approaches exists to build a triple store.
● Main challenges are system scalability and generality.
3
4. Basic Concepts - RDF Triple
● RDF dataset consists of statements in the form of
(subject, predicate, object)
● Subject has a predicate property whose value is the
object.
● Examples: <Titanic, has award, Best picture>
● Core of the semantic web is built on top of the RDF data
model.
● These triples can be stored in different ways.
4
5. The Problem
● Apache Cassandra is a Nosql, multi tenant and multi
data centric database.
● Our objective is to build a scalable RDF store for
Apache Cassandra.
● Cassandra is used by eBay, Twitter, Cisco, etc.
● This will exponentially increase the value of Cassandra.
● The largest known Cassandra cluster has 300 TB of
data over 400 machines.
● This motivates us to build a distributed, scalable RDF
store to answer user queries on them efficiently. 5
6. Literature Review - Concepts
● A triple store can be built on top of any DBMS or File system.
● RDF dataset consists of statements in the form of <subject, predicate,
object>
● Subject has a predicate property whose value is object.
● Ex. <person1, name, Mike>
● A typical triple store holds a multi millions/billions of such triples.
● Efficient and scalable management of RDF data is a fundamental
challenge.
● SPARQL queries are submitted to the RDF store.
Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph
Engine for Web Scale RDF Data,"
6
7. Apache Cassandra
● Distributed, fault tolerant (i.e. no single point of failures),
post relational, Nosql database system.
● Peer to peer distributed architecture. Supports both strict
and eventual consistency.
● All the nodes are the same. There is no master and slave
nodes.
● Uses read/write anywhere style architecture.
DataStax Corporation. (2011, October) “Welcome to Apache Cassandra 1.0”
7
8. Triple store –approaches
● There are different approaches the exist to manage
RDF data.
● Each approach has it’s own advantages and
disadvantages.
8
9. Relational Approach
● Triples are stored using the relational model.
Justin J. Levandoski F. Mokbel, "RDF Data-Centric Storage,"
9
10. Relational Approach (contd.)
● Triple store - yields costly self joins of a huge RDF store
(trillions of triples)
● N-array - eliminates the need for joins, but leads to
higher number of nulls.
● reduces null storage, but introduces costly join.
10
11. Graph based approaches
● New approach that greatly improves the performance of
SPARQL query processing
● Graph exploration instead of joins.
● Unnecessary intermediate results can be pruned down.
● Models RDF data in it’s native graph form.
● Examples: Trinity, TripleRush etc.
11
12. Trinity RDF
● Graph based implementation. Models RDF as a DAG.
● Subjects and objects are represented as a node.
● Predicate is represented as a directed labelled edge.
● Graph is stored in memory for fast access.
H. Wang, and Y. Li B. Shao, "The Trinity graph engine. Technical Report 161291, Microsoft Research," 12
13. Trinity Architecture
● Distributed in memory key value store.
● Partitions RDF graph across multiple machines by hashing on the nodes.
● Each machine holds a disjoint part of the graph.
● Final result is assembled at the proxy.
Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph Engine for Web
Scale RDF Data,"
13
14. Methodology
● Use case Scenarios
○ Populating data into Cassandra Cluster
○ Building the RDF Graph
○ Querying the RDF Graph
○ Dropping the RDF Store
● Technologies used.
○ Apache Jena RDF API
○ Struts 2
○ Java/JSP/XSLT/XML/XPath
14
17. Evaluation and Result
● DBPedia benchmarking was used to compare.
● DBPedia geo-coordinates and homepages dataset was
used. Accounts for 0.7 million triples
● 4Store, Bigdata RDF stores were compared with our
implementation
● Queries used
○ Query One: Finds the homepage of the Metropolitan museum of Art
○ Query Two: Finds the Homepage of Kevin_Bacon
○ Query Three: Finds all the resources and their homepages which
reside near the area of Berlin.
○ Query Four: Finds all the resources and their homepages which reside
near the area of New York. 17
18. Benchmark Results
● Query complexity increases from Q1 through Q4.
● The execution time taken by different RDF stores, to execute above four queries.
● Query execution time is measured in ms.
Q1 Q2 Q3 Q4
Our
implementation
216ms 7ms 336ms 279ms
4Store 16ms 18ms 455ms 416ms
Bigdata 41ms 30ms 2sec, 355ms 1sec, 600ms
DBpedia. (2008, Jan 10.) RDF Store Benchmarks with DBpedia [Online]. Available:
http://wifo5-03.informatik.uni-mannheim.de/benchmarks-200801/ 18
21. Benchmarking Analysis
● Graph based approach yields more performance boosts
when query becomes more and more complex
● Complexity increases from Query 1 to 4 gradually.
● This implementation outperforms 4store and bigdata
especially when the complexity of the query increases.
● First query takes time, because it builds the index
structure.
21
22. Future Work
● Main limitation of the approach is Scalability.
● Larger datasets lead to OutOfMemory error while building the graph model.
● Solution: Distributed implementation
22
23. Conclusion
● Approaches used to model and retrieve RDF data.
● New approaches to manage RDF data efficiently.
● Graph based approach.
● New Implementation
○ Use case scenarios
○ Evaluation and result using DBPedia dataset
○ Benchmark Analysis
23