RDF and Java


Published on

The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, ``in which information is given well-defined meaning, better enabling computers and people to work in cooperation''\cite{semweb:lee}. RDF, together with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF: Jena, Sesame and JRDF.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

RDF and Java

  1. 1. RDF and Java Monica Macoveiciuc and Constantin Stan Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. The Web is a universal medium for information, data and knowledge exchange. The Semantic Web is an extension of the World Wide Web, “in which information is given well-defined meaning, better enabling computers and people to work in cooperation”[1]. RDF, to- gether with SparQL, provide a powerful mechanism for describing and interchanging metadata on the web. This paper presents briefly the two concepts - RDF, SparQL - and three of the most popular frameworks (written in Java) that offer support for RDF.
  2. 2. RDF and SPARQL 1 What is RDF? RDF (Resource Description Framework) is the W3C standard for encoding knowledge. It is a structure for describing and interchanging metadata on the Web in numerous forms and purposes. RDF provides a framework, that is consistent, and syntax for describing and querying data. It also makes easy and possible sharing website descriptions. RDF’s family of specifications are quite complex and a difficult to manage, that’s why there are times when using the full potential of its capabilities is not an easy thing to do. The RDF offers a model for describing resources which have proper- ties (attributes or characteristics). Any object that is uniquely identifiable by an URI (or Uniform Resource Identifier) is considered by RDF a resource. These re- sources have properties associated with them and these properties are identified by property-types which have, on their turn, associated values. Property-types define the relations between values and resources. The values may be atomic or other resources (which can, obviously, have properties). A group of properties that belong to the same resource is called description. The RDF’s core stands in the triple described above. This actually states that only three pieces of information are all that’s needed to fully define a bit of knowledge. So we have the resource (or subject) - the thing that’s being described (identi- fied by an URI), the property-type (or the predicate) such as a relationship, an attribute or a characteristic, and, in addition to the subject and the predicate we have the third component which is the value of the resource property type (or the object). An RDF triple documents these three pieces of information, within the RDF specification, in a consistent manner so that allows, in an ideal way, consumption of the same data on both on human and on machine ends. This allows human meaning and understanding to be interpreted consistently and mechanically. For example let’s consider these two sentences: I have a name, which is Monica Macoveiciuc. I have a gender, which is female. I have a job, which is programmer. We can quickly identify the triple about which we talked about earlier within the above sentences: I (subject) have a name (property), which is Monica Macoveiciuc (property value).
  3. 3. I (subject) have a gender (property), which is female (property value). I (subject) have a job (property), which is programmer (property value). There are many ways to represent a triple. For example we can use the 3-tuple representation. In this case we’ll have: subject, predicate, object Applied on the examples above we get: {I, name, Monica Macoveiciuc} {I, gender, female} {I, job, programmer} The above is just one way of serializing RDF data. The formal way to serialize this data is the directed graph (a directed label graph). There are two main reasons that were considered when this method was chosen as default represen- tation and these reasons are: the graphs are extremely easy to read (there is no confusion between the 3-tuple core elements, the can be no confusion about the statements that are being made) and there are some RDF data models that can be represented this way (using RDF graphs), but not in RDF/XML. The graph is a set of nodes connected by arcs which form a pattern of node-arc-node. There are 3 types of nodes: blank nodes, literals and uriref. RDF requires a syntax that represents this model, in order to store instances of this model in machine accesibile/readable files and to communicate these instaces among application. The answer for this required syntax is XML. In order to have XML supporting the consistent representation of semantics, RDF imposes formal structure on it. To provide unicity within its identification RDF uses the namespace mecha- nism (which is part of the XML technology). The RDF Schema acts as a boot- strapping mechanism for the declaration of the necessary vocabulary used in expressing the data model. Elements as RDF:RDF or RDF:Description have specific meaning. Both belong to the same namespace: RDF. For example the RDF:RDF tag marks the boundaries within an XML document where the con- tent is intended to be written to fit into an RDF data model instance and the RDF:Description tag is designed to reflect the corresponding data model. The constraints imposed by RDF are there to support the consistent encoding and exchange of standardized metadata defined by different communities. 2 What is SPARQL? SparQL (which is pronounced “sparkle” and has as recursive acronym SPARQL Protocol and RDF Query Language) is an RDF query language. It’s a fresh W3C Recommendation about which Sir Tim Berners-Lee said that “will make a huge difference”. RDF is pretty foundational to the Semantic Web. Until SparQL’s launch, RDF had a data model, a formal semantics, and a concrete serialization (in XML), but what it didnt have was a standard query language.
  4. 4. SparQL came in place and now offers to the Semantic Web and to Web 2.0 a common data manipulation language in the form of expressive query against the RDF data model. Using WSDL 2.0, SparQL Protocol for RDF describes a very simple web service with one operation, query which is available with both HTTP and SOAP bindings. This operation is the way you send SPARQL queries to other sites and the way you get back the results. The HTTP bindings are REST-friendly and a simple SparQL protocol client takes little amount of code in order to implement. SparQL consists of 3 separate specifications. The first one is the query language specification (which makes up the core). The second is the query results XML format (which describes an XML format dor serializing the results of an SparQL queries - SELECT, ASK). The third specification is the data access protocol (which uses WSDL 2.0 to define simple SOAP and HTTP protocols for remotely querying RDF databases - or any data repository that can be mapped to the RDF model). Alltogether it consists of a query language, a mean of conveying a query to a query processor service and defining the XML format in which the results will arrive. Some issues are not addressed yet by SparQL. The most notable is that it can’t modify an RDF dataset (it’s read-only). As we mentioned previously, RDF is build on the triple pattern (a 3-tuple consisting of subject, predicate, and ob- ject). Similar to RDF, SPARQL is built on the triple pattern, which also consists of a subject, predicate and object. SparQL allows to match patterns in an RDF graph using triple patterns, which are like triples except they may contain vari- ables in place of concrete values (the variables are used as “wildcards” to match RDF terms in the dataset). The SELECT query can be used to extract data from an RDF graph, returning it as an array result set. For more complex graph patterns one should use re- quired and/or OPTIONAL data. UNION queries are also a way of dealing with selecting alternatives from the dataset. It is possible to apply ordering to the results, jump forward through results using OFFSET, and LIMIT the amount of data returned. The SparQL Query Results XML Format specification includes several relevant examples. Given its obvious simplicity and regular structure, manipulating this format with XSLT or XQuery is fairly trivial. The syntax shortcuts make writing queries much simpler. These are especially useful with repetitive graph patterns and long URIs. SparQL presents itself as being the missing and long waited part from the Semantic Web and Web 2.0.
  5. 5. Java APIs for RDF There are many frameworks for processing RDF available for Java programmers. Some of them also offer support for SPARQL inferences. This paper presents three of the most popular frameworks: Jena, Sesame and JRDF. 3 Jena 3.1 The Model Jena uses the concept of graph for dealing with the data: the nodes correspond to URIs, while the edges are the triples. The graphs are represented through the Model interface, which has different implementations: a memory-based one, one which uses a relational database etc. The memory-based model is the simplest and easier to use one. A triple is represented through an interface called Statement. A statement cor- responds to an edge in the graph and consists of three parts: – the subject - the resource from which the arch leaves - implements the Re- source interface; – the predicate - the property (the label of the arch) - implements the Property interface; – the object - the resource that is pointed by the arch - implements the Re- source or the Literal interface. The components of the statement have a common base - the RDFNode interface.
  6. 6. The object component is more complex. A statement can be used as the object component of the triple, since RDF allows nested statements. Objects imple- menting the Container, Alt, Bag, or Seq interface can also be used as objects. A resource is declared as follows: Model model = ModelFactory.createDefaultModel(); String resourceURL = "http://localhost:8080/George"; Resource person = model.createResource(resourceURL); The ModelFactory method createDefaultModel() creates a memory-based model, which is then used for creating a resource. This is done by calling the createRe- source method, to which we provide the URI of the resource. Jena API contains constant classes for some well known schemas, such as RDF and RDF schema, Dublin Core and DAML. Adding the Formatted Name property of the vCard file format can be done easily: person.addProperty(VCARD.FN, "George"); An RDF Model is represented as a set of statements. Accessing the components of the statement can be achieved through the getSubject, getPredicate and getO- bject methods of the Statement class. The API provides methods for the most common operations: – addProperty - adds a new statement (triple) to the model; – listSubjects - lists the subject component of each triple from the model; – listObjects - lists the object component of each triple from the model; – write - writes the model in RDF XML format to the output stream given as parameter; – read - reads the statements in RDF XML format into a model. The Jena2 persistent storage subsystem implements an extension of the Model class that provides transparent persistence for models through the use of a database engine. Implementations for MySQL, HSQLDB, PostgreSQL, Oracle and Microsoft SQL Server are provided and other databases have been added by 3rd parties. TDB and SDB are two components of Jena that provide large scale storage and query of RDF datasets. SDB is a system that uses relational databases for storage of RDF and OWL. It supports many open source and commercial databases including MySQL, Post- greSQL, Oracle 11g, Microsoft SQL server and IBM DB2. It scales to graphs of 100 million triples. TDB is a non-transactional, faster database solution for use by a single system. It scales well beyond SDB and is simpler to setup. 3.2 Inferences SparQL is implemented in Jena through the ARQ package, and queries may be made within Java scripts or via a SparQL client distributed with Jena.
  7. 7. The package containing that offers SparQL support is com.hp.hpl.jena.query. There are four types of queries supported by the Jena classes: SELECT, ASK, DESCRIBE, CONSTRUCT. ASK query returns “yes” if the query’s graph pattern has any matched in the dataset and “no” otherwise. DESCRIBE query returns a graph containing information related to the nodes matched in the graph pattern. CONSTRUCT query is used for creating a RDF graph for each solution of the query. For running a query, one needs: – a Query object, obtained through the create method of the QueryFactory; – a QueryExecution object, obtained through the QueryExecutionFactory; – an execute method, depending of the type of the query. The results are provided in the form of a QuerySolution object, and a ResultSet can be used to iterate over the solution. The results can be refined through the SparQL options DISTINCT, LIMIT, OFFSET, ORDER BY, optional and alternative matches and filters. Jena offers support for working with multiple graphs. The DataSetFactory class can be used to specify named graphs to be queried programmatically. 4 Sesame 4.1 The Model As Jena does, Sesame uses a graph model for the resource. URIs are nodes, and triples are a pair of edges (an edge from subject to predicate, and an edge from predicate to object) each. A central concept in Sesame is the Repository. A repository is a abstraction of storage container for RDF data. This can mean Java objects in memory, or it can mean a relational database. Virtually all op- erations in Sesame happens with respect to a repository: the repository is the provider of persistence and querying capability. The Graph API provides a representation of an RDF graph in the form of a Java object. The Graph object is used to store the triples. In order to be able to add statements to the graph, one must obtain a ValueFactory object from the Graph. Graph graph = new org.openrdf.model.impl.GraphImpl(); ValueFactory factory = graph.getValueFactory();
  8. 8. Adding a statement is done similar to Jena: String resourceURL = "http://localhost:8080/human#"; URI subject = factory.createURI(resourceURL, "person"); URI predicate = factory.createURI(namespace, "hasName"); Literal object = factory.createLiteral("George"); graph.add(subject, predicate, object); Sesame offers the possibility of running SeRQL-construct queries in order to cre- ate and update graphs. Another capability of the framework is allowing adding and removing graphs from a repository. SAIL is Sesame’s abstraction from the storage format used and also provides reasoning support. In the persistence layer, there are SAIL implementations for PostgreSQL, MySQL, SQL Server and Oracle database. SAIL can be used to implement concurrent access handling and caching. Each Sesame repository has its own SAIL object to represent it. There are few operations that are defined by the SAIL abstraction, such as adding and removing triples, starting and committing transactions, clearing the repository etc. 4.2 Inferences Sesame does not offer support for SparQL, but it does include a new RDF/RDFS query language, SeRQL. SeRQL stands for “Sesame RDF Query Language”. It combines the best fea- tures of other query languages (RQL, RDQL, N-Triples, N3), also adding some of its own. Its most important features include: RDF Schema support, XML Schema datatype support, graph transformation, optional path matching. SparQL and SeRQL are quite similar: they both support advanced path ex- pressions as branching and chaining, optional paths and partial match of the target graph. SeRQL allows SELECT, CONSTRUCT and DESCRIBE query
  9. 9. types and their functionality is similar to the one provided by SparQL. When speaking about the set operations, SparQL is limited, UNION being the only operation allowed. SeRQL offers support for more operations: – union - UNION; – intersection - INTERSECT; – difference - MINUS; The operators IN, ANY, ALL, EXISTS and nested queries are other features supported by SeRQL. Some limitations of SeRQL include the missing of ORDER BY clause and no support for regular expressions. 5 JRDF 5.1 The Model JRDF Java RDF Binding is an attempt to create a standard set of APIs and base implementations to RDF using Java. It is based on existing libraries, such as Jena, Sesame, Aquamarine and Sergey Melnik’s RDF API. Unlike the other frameworks, JRDF tries to deal with most of the aspects that are useful for Java programmers and tp ensure a high degree of modularity. It includes a default memory implementation that can be used in conjunction with Mulgara to pro- vide a scalable RDF solution. As Jena and Sesame, JRDF offers a graph-based view of the RDF data. The Graph interface is used for the representation of the graph. A graph consists of RDF structures such as triples, literals, URI References. A graph is created as follows: JRDFFactory factory = SortedMemoryJRDFFactory.getFactory(); Graph graph = factory.getGraph(); GraphElementFactory elementFactory = graph.getElementFactory(); Node node = elementFactory.createURIReference(URI.create("urn:node")); graph.add(node, node, node); The methods provided by the API allow adding, removing and finding triples. The components of the triple - the subject, the predicate and the object - have a common base: the Node interface. This represents the top of the class hierarchy of the JRDF model. The Node is subclassed by the positional nodes: Subject, Predicate and Object. These are also subclassed by other types of node, such as URI, Literal and bnode (the blank node).
  10. 10. There are four JRDF Graph implementations: 1. The memory graph - it is included in the jrdf jar and it is useful for small graphs. 2. The server-side JRDF Graph - it is a server-side interface provided by Mulgara. The graph is created in the JVM and can be used for direct access to the database using a graph API. 3. The client JRDF Graph - Mulgara provides a client-side JRDF graph interface for accessing a model, which represents a scalable solution for remote client applications. 4. iTQL graph - this is a read-only graph that can be created from the results of an iTQL query (used for retrieving data and updating Mulgara databases). This offers the possibility of displaying the results as a subgraph. 5.2 Inferences JRDF contains an implementation of SparQL, although it is not complete. But the API does offer support for developing a powerful query engine. Such an implementation (based on JRDF) requires a mapping between RDF and the Relational Model. An approach for this is using a modified relational algebra to represent the JOIN, UNION and OPTIONAL operations. This algebra must support untyped relations and operations. These must be defined to work with tuples of differing attributes, to cover all the possible types that a tuple can contain. 6 Support, Documentation and Licensing Jena, Sesame and JRDF are all cross-platform and they are available under BSD-style license. However, Jena seems to be the most popular among these solutions. This is because it provides a robust API and great support for rea- soning, along with good documentation and support for developers. Jena Documentation page contains the public API, together with a tutorial and a FAQ section. Great attention is paid to practical examples - there are many
  11. 11. HowTo’s included, covering a large are of interest, from creating models to con- currency and locking issues. Other resources are presented, such as SparQL, with useful links. There is also a mailing list (jena-dev) and a large dev-community built around the project. The Jena website includes a user contributions page, which contains really interesting examples provided by the Jena users. Sesame Documentation is comparable to the one provided by Jena. A user man- ual describes in detail, with examples, each part of the framework. The Docu- mentation section includes some tutorials, FAQs and links to external resources. There are also some mailing lists and an old (now not functional) forum. The users can also report bugs and problems through an Issue Tracker. JRDF offers less support for developers than the other two frameworks. A Wiki section contains some basic description and examples. Javadocs are available for six releases of the project, providing a good way of tracking the changes. There is also a mailing list and some links to related publications.
  12. 12. Conclusion All three frameworks are mature enough to support complex applications. Each of them is better than the other under certain aspects, and it is the user who should decide which API to use to best cover the application’s needs. One criteria to take into account is the query language that the application needs to use, since Sesame doesn’t support SparQL (although it does come with its own solution) and neither does JRDF. Sesame provides support in scripting languages - Perl, PHP5 - which can be really useful. JRDF is a good example of good practice, trying to use standard Java conventions. References [1] Berners-Lee, T.; Hendler, J.; Lassila, O.: The Semantic Web. Scientific American Magazine (March 26, 2008) [2] Powers, Shelly: Practical RDF. O’Reilly 2003 [3] http://jena.sourceforge.net/ [4] http://www.xml.com/pub/a/2001/01/24/rdf.html [5] http://www.ibm.com/developerworks/xml/library/j-sparql/ [6] http://www.openrdf.org/documentation.jsp [7] http://www.dlib.org/dlib/may98/miller/05miller.html [8] http://www.oreillynet.com/xml/blog [9] http://www.xml.com/pub/a/2005/11/16/introducing-sparql-querying-semantic- web-tutorial.html [10] http://www.w3.org/TR/rdf-sparql-query/ [11] http://en.wikipedia.org/wiki/SPARQL