Rdf Processing For Java A Comparative Study


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Rdf Processing For Java A Comparative Study

  1. 1. RDF Processing for JAVA: A comparative study Ioanid Tibuleac, Cristian Turlica, 1 Facultatea de Informatica, Universitatea „Al. I. Cuza“, Iasi, Romania {ioanid.tibuleac, cristian.turlica}@info.uaic.ro Abstract. This paper aims to be an introduction to some RDF processing APIs for Java developers. The APIs that are given a brief description here are Jena and Sesame. Aspects like RDF storage capabilities, RDF access through SPARQL queries and overall programmer support are taken into consideration. Some tests have been conducted to estimate which of the two runs SPARQL queries faster on an in memory graph read from a file. Conclusions are that Jena will generally run slower then Sesame when executing a single query, but its optimizations allow it to perform better when executing a sequence of queries. Keywords: API, RDF, SPARQL, Java, Jena, Sesame. 1 Introduction This paper discusses certain aspects of RDF processing APIs for Java developers. We have chosen two of the most used APIs, according to our own opinion. Jena and Sesame offer RDF data access, storage in files, sql or native RDF databases, querying and inferencing. These features have made us select them as our test case. 2 Jena RDF Api Jena is an open source Semantic Web Framework for Java developed by researchers from HP Labs Semantic Web Programme [1]. It provides support for RDF manipulation, from creation and storage of statements, to SPARQL queries and RDF graph operations. Besides the RDF API, the Jena framework also contains the OWL API, a component for processing ontology, and a rule based inference engine. In Jena RDF elements have been modeled into Java classes. The RDF graph concept is called model and is handled using an instance of the Model class. Other concepts like resource, property and literal are represented by the Resource, Property and Literal interfaces. These interfaces are contained in the jena.rdf.model package, toghether with a ModelFactory that allows the creation of models with various storage methods. The model is built as a set of statements, thus elimininating the existance of duplicates, and supports the union, intersection and difference graph operations. The
  2. 2. jena.rdf.impl package offers implementation for the interfaces of the RDF elements that is used by the model. The Jena framework offers various representation modes for RDF triples. Besides memory and file storage, Jena comes with two systems designed to persist RDF information, the TDB and SDB. File level storage is obtained using Java InputStreams and OutputStreams. Though the API also contains methods to read or write triples with either a Java Reader or Writer class, there is a strong warning about using these methods when writing files. Problems may appear due to the encoding of the output file. Supported RDF formats are RDF/XML, RDF/XML-ABBREV, N3,N-TRIPLE and TURTLE. The SDB offers RDF persistence using conventional SQL databases. As a result, specific database tools can be used to improve and secure data access, while offering support for SPARQL queries. A multitude of database management systems can be used, including Microsoft Sql Server 2005, Oracle 10g and PostgreSQL. The TDB offers native support for triples and SPARQL queries, allowing custom indexing and storage. This Java engine uses both static and dinamic optimization of SPARQL queries, taking into account partially retrieved data. These features make the TDB engine faster then the SBD, according to the developers. The Jena framework contains an implementation of the W3C SPARQL specifications, the ARQ query engine. The access given by the model interface is limited to iterating statements that satisfy certain conditions, but this is extended by the jena.rdf.query package. The supported SPARQL constructs are SELECT, CONSTRUCT, DESCRIBE and ASK. The Jena framework comes with documentation and tutorials that allow programers to easily test its capabilities. In depth information is also available for more experienced users. Community information is also available on various sites, like [5], showing that the Jena framework is used and that its development will be continued. 3 Sesame Sesame is an open source framework for storage, infering and querying the RDF data [2]. The RDF API may be used to manipulate statements in a normal java application, or as a part of a client –server application. The Sesame framework also contains a Http Server that can be addressed using the SPARQL protocol. The Sesame framework has a more complex architecture. At the bases of the architecture is the RDF Model where the basic RDF concepts, like literal or statement, are defined as interfaces. There are other specialized components, like the Rio (RDF I/O) that manage reading and writing RDF to various file formats and the Sail API (Storage and Inference API) that gives uniform access to a RDF storage regardless of what it may be. The API used to manipulate RDF data at a higher level is the Repository API that offers access via the Sail API or via Http to a remote repository. Sesame offers in memory, native and remote access to RDF data. The Sesame framework uses the SeRQL (Sesame RDF Query Language). Apperantly this language is very similar to SPARQL and features have been adopted
  3. 3. back and forth between the two. Thought we did not take time to notice significant differences between the two, a partially different language then the standard may require additional time to get used to. The Sesame framework comes with a lot of documentation, but unfortunately it may prove to be too difficult for less experienced users. Running a simple program has proven, at first, a little difficult for us, because of the additional libraries used by the Repository API (for example Simple Logging Façade for Java). As a result we have turned to online help like [4]. Overall, the documentation is perhaps more detailed then the one for Jena, but simple examples are scarce. 4 SPARQL Tests We have made several tests using the two APIs and a two RDF files that vary in size. The development environment used was Eclipse. We have used a code sample for Sesame available at [4]. Our main focus was testing SPARQL query execution speed, using files as storage for the RDF statements. The first RDF file is a larger file containing information about sessions and speakers at a conference [3]. The SPARQL query selects information about distinct presentations: SELECT DISTINCT ?title ?presenter ?description WHERE { ?session rdf:type svcc:Session . ?session dc:title ?title . ?session svcc:presenter ?presenter . ?session dc:description ?description . }; Execution times obviously favor Sesame over Jena (as shown in the table below). The documentation for Jena explains that a search is conducted for the reuse of the rdf:ID element and this may cause a slower response when reading large files. Query 1 execution Jena Sesame 1 2172 656 2 2094 625 3 2125 687 4 2062 625 5 2031 641 The same situation occurs for the second query that searches in a file containing information about semantic web tools [6], though the timing difference is reduced. SELECT ?nume ?url ?limbaj WHERE { [g:label ?nume; g:URL ?url ;
  4. 4. g:FOSS ?foss ; g:Category ?categ ; g:Language ?limbaj ] . FILTER( ?foss = ‘Yes’ && ?categ = ‘Database/Datastore’ && (?limbaj = ‘PHP’ || regex (?limbaj, ‘^C’))) . } ORDER BY ?limbaj Query 2 execution Jena Sesame 1 1860 656 2 1875 672 3 1891 672 4 1875 688 5 1813 688 Runing both tests shows that Jena’s execution speed increases as more queries are made, getting close to the performance of Sesame. Combined Jena Q1 Jena Q2 Sesame Q1 Sesame Q2 execution 1 2110 234 765 188 2 2782 250 985 265 3 3063 406 1156 250 4 2156 187 719 187 5 2251 265 735 203 Out initial tests were somewhat different because we used a Sesame repository object with inferencing, although there was no need for it. In this case, Sesame’s performance decreased but it still managed to outrun Jena on single query execution. However, multiple query execution confirmed that Jena can perform better in such cases. In conclusion, we see the Jena RDF API as an easier starting point for most programmers, thought it might not be as complex as the Sesame framework. References 1. Jena website, http://jena.sourceforge.net/documentation.html 2. Sesame website, http://www.openrdf.org/documentation.jsp 3. Hewett Research, http://www.hewettresearch.com/svcc2009/ 4. “How to use the Sesame Java API to power a Web or Client – Server Application", http://answers.oreilly.com/topic/447-how-to-use-the-sesame-java-api-to-power-a-web-or- client-server-application/ 5. “Jena, A Java API for RDF”, http://www.docstoc.com/docs/13042314/Jena-----A-Java-API- for-RDF 6. Sweet rdf file, http://profs.info.uaic.ro/~busaco/teach/courses/wade/demos/sparql/sparql.zip