Publishing "5 star" data: the case for RDF


Published on

In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data.  This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.

Published in: Business, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Publishing "5 star" data: the case for RDF

  1. 1. “In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data. This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.” Isn’t is awful when we’re trying to communicate and we’re misunderstood. Not only can it lead to problems as a direct result of the misunderstanding, but there can also be quite a bit of hassle in getting things straightened out after the mistake. In the case of Ginger and Fred1 it nearly became a showstopper. Data sharing within and between enterprises has always been a costly effort – The Open Group reckon “... that between 40% and 80% of application integration effort is spent on resolving semantic issues, a task that typically requires significant human intervention. The expanding use of Service Oriented Architecture (SOA) and Cloud Computing are further increasing the need for semantic interoperability that more efficiently aligns IT systems with business objectives.”2 ... and naturally, the Open Data programmes have similar issues. This is where RDF is playing a key role, both inside and outside enterprises. There is a lot of talk in the ‘Open Data’ world about “5-star” RDF data and this indicates a meritocratic hierarchy of models of data, so why is RDF ‘tops’? RDF is also key to the “Semantic Web” , also described as “Web 3” – the next generation of web technology. We are also hearing about this “internet of things” and RDF plays a significant role there. 1
  2. 2. So what is there for me, my business, my organisation in considering using data modelled as RDF? RDF (Resource Description Framework) originated in the 1990s as a way of adding metadata to XML documents, but it’s actually also a very tidy way of describing any data. RDF is a model in which data are expressed as triples comprising a Subject and an Object related by a directional Predicate. From a little after the Ginger and Fred era until about the late 1990s interchanges of computerised data tended to follow detailed discussion and agreement between 2 parties about the data they were exchanging, the model systems of the provider and recipient, mappings and semantics relations, etc. etc.. Data exchanges required extraction, transformation and loading stages (ETL) – and this is still the situation in many situations. The RDF model removes much of the heavy lifting required in traditional ETL. “One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and de-duplicated. In my opinion this is the most important benefit of using RDF over other open data formats.” (Ian Davis, 2011)3 Another approach to data integration, particularly within enterprises, is to develop one big pot into which all the organisation’s key data fits – the enterprise data warehouse. The problem with this in brittleness – the warehouse takes an age to design, and then one can only fill it with items that it was built to contain. Anything that is the wrong ‘shape’ has to be either rejected, refashioned (ETL), or else be so important as 2
  3. 3. to be worth the cost of adding an extra wing to the warehouse to cater for it. Lots of analysis, coding and so-on. In the RDF world the containers for native RDF are promiscuous, accepting any RDF and not just that which fits a particular schema or pattern. Adding new RDF statements with relationships that are not currently present in the dataset does not require the sort of preparatory work needed in the relational model such as the addition of new join tables to the database. In conventional data management situations the semantics of the data tend to be observed in the interface and expressed in the documentation. RDF, in contrast, has explicit semantics for the relationships between entities or from an entity to a literal, and also provides a mechanism to build in the descriptions of individual classes of entities and descriptions of the dataset. The final aspect of RDF that I want to focus on as a potential benefit is that RDF data is a ‘graph’ of nodes and edges which when visualised as a set of circles and arrows, for example, has a particular shape within which one can see clustering and sparseness in ways that is difficult to achieve with other models. So RDF data is “5 star” because I don’t need to have a dialogue with a range of data providers to unambiguously join together their datasets into a “supergraph” that I can then work with. I don’t necessarily need to modify the container that I use beforehand in order to tailor it to accept data – RDF models can be merged automatically in the absence of a schema. Datasets are, at least to a minimal level, self-describing in that they automatically detect what is the same and what is different, 3
  4. 4. what items are entities and what are properties/relationships. Data becomes collated and de-duplicated automatically. In addition to these arguments in favour of the RDF model for data interchange, the increasing availability of open RDF Linked Data is going to mean that organisations that are not using these approaches are missing out on not being able to effectively make use of openly available RDF from multiple sources in its native, efficient form – they will have to reduce it to a semantically less rich form (JSON, CSV, etc) and this requires ETL steps and so on prior to use. So I am going to give a tour of some resources and illustrations that might be of help to those individuals and organisations wanting to get started in the RDF world First step is education/training at scale The Euclid Project [ ] Second step: starting to work with RDF publication: RDF is an efficient way to merge datasets from multiple sources unambiguously and relatively automatically. Think of it as an equivalent to RNA in biology, in that the main store of genetic information is held securely in another form (generally DNA, but also negative strand or double stranded RNA) but for working purposes (building proteins) that data is converted to a biological globally shared form, RNA. In the same way it is possible to hold data all the time as RDF and work effectively with it, but there are many situations where that isn’t optimal – either for historic reasons in that there is existing infrastructure that works effectively on other data models, or – as in 4
  5. 5. highly transactional systems – the RDF approach isn’t suited to the normal operations. Hand written/scripted One easy way to create some RDF is to write it by hand or by some simple scripts. This is often used for learning about RDF or for developing new ideas. XTurtlex4 is an Eclipse plugin that makes this job much smoother. Generating RDF with scripts can be done using a templating approach but, as with constructing XML using specific tools, there are a range of RDF tools for constructing RDF statements in code: Java has Jena5 and Sesame6 APIs, Python has RDFlib7, Ruby has the RDF.rb8 gem, and several scripting languages have bindings for the Redland9 library which is in C. OpenRefine10 + RDF plugin11 This is an application that makes cleanup and conversion from a range of data types, including delimted text and spreadsheets into RDF. There are also options to use ‘reconciliation services’ which are APIs that provide best-guess suggestions for widely used URIs for entities based on text in your data. These reconciliation services come from Freebase12 (Google) and also recently the Ordnance Survey13 Conversion from relational databases There are several ways in which RDF can be published from relational databases. This is often a good way to get RDF out of an existing system with minimal hassle, and this might be a low-risk way of getting into publishing some of your data as RDF. 5
  6. 6. Methods that use wikis: Wikipedia and DBPedia: Information in Wikipedia factboxes eventually ends up as RDF data published by DBPedia during the biannual conversion process. DBPedia Live14 attempts to keep abreast of the rapid rate of page updating that is undertaken on Wikipedia Semantic Mediawiki15 16 17 is an extension of the Mediawiki software that is used for Wikipedia, that has an underlying RDF model for its data. Semantic Mediawiki provides routes for exporting both subsets of wiki pages and the whole wiki content as RDF. Drupal 7.018 outputs RDFa data in core. RDFa (or Resource Description Framework- in- attributes) is a W3C Recommendation which allows embedding RDF metadata within Web documents. These RDF assertions can be ‘gleaned’ from the web pages by stylesheets and other ‘distillers’. However, RDFa isn’t used much ‘in the wild’ at the moment. Relational to RDF mappers These act as “babelfish” to translate a relational database schema into an RDF model through a mapping procedure (the applications assist that process, but it often needs hand-finishing) and provide a query interface (i.e. these mapping applications create a SPARQL endpoint and return RDF, but the underlying data is maintained in a SQL database). This approach is ideal for providing RDF from an existing application which you don’t want to mess with but for which you want an RDF output. Examples of relational to RDF mapping software include D2R19 and the polyglot storage server 6
  7. 7. Virtuoso20. Examples of use of these tools include the TellMeScotland open data publication21 and the EC Joinup pilot linking Belgian addressing data22# Triplestores Native RDF can be stored either as a graph in memory, or within a native RDF triplestore which is a database specifically designed for RDF graph structures. Native RDF databases a.k.a. Triplestores RDF can be stored as native triples in SQL databases, using long skinny tables with three columns ffor the triples and having various indexes (SPO, OSP etc), This is the approach taken with the Jena SDB datastore, but to some extent this was a naive/simplistic approach using tools like MySQL and Postgres that were readily available in the early 2000s. Subsequent work has focused on developing native RDF datastores that don’t use tables in the SQL sense but use node tables and indexes where the focus has been on optimising both storage and search for the RDF model rather than making use of a more generalised data store. Examples include TDB, Sesame and Mulgara all of which are Java applications, and 4Store which is built in C and only easily compiled on Linux. Other approaches include column stores (e.g. Virtuoso and Vertica). So, if you are looking for an easy way of installing and using a triplestore what is the best approach? The Apache Jena “TDB”23 triplestore with the Joseki24 or Fuseki25 SPARQL endpoint is one option I’ve used a lot. Other simple options that I’ve had some experience of include Sesame26, Mulgara27, Bigdata28, Virtuoso29 and 4Store30, but this is not a definitive list, and each has an associated SPARQL over HTTP query option. 7
  8. 8. Choice of triplestore will depend on your OS options (e.g. 4Store is built from source and this is easiest with Linux), how much RDF you are storing (usually measured in millions/billions of triples), and the additional functions (e.g. geo indexing is available with Virtuoso, Parliament and a small number of others; Allegrograph has social networking stats functions built in) One advantage of triplestores over SQL stores is that transferring data from one to another is simply a matter of outputting triples from one and loading them into another. Therefore the risk of picking the ‘wrong one’ to start has limited negative consequences. There are VM images available for some of the geo-capable triplestores – a very useful resource.31 Linked Data APIs When you have data in a triplestore one doesn’t want to just leave potential users with a SPARQL endpoint – it’s daunting and unhelpful to many potential users of your data. A Linked Data API is a much more pleasant decoration. A couple of examples include PublishMyData32 (mainly Ruby, example at 33), Elda34 (mainly Java, example at 35). A Linked Data API provides a faceted HTML view of your data and also helps resolve URIs that have the base URI at your site to some HTML page. For example, the identifier for Victoria Quay is If you put this into your browser you get redirected to an HTML page about Victoria Quay: APIs also help the return of RDF describing resources in different machine 8
  9. 9. readable formats, either by responding to the HTML “Accept” header, or by handling HTTP Server Code 303 re-directs appropriately, e.g.: returns NTriples and returns RDF/XML representations for the same resource. FluidOps Workbench This is a hybrid tool36 that provides a wiki interface for the creation of new content but also enables the import of data from various sources into a local Sesame triplestore. An example of the Workbench with Wikipedia/DBPedia data is at End of the tour This is the end of the brief tour. Next is a quick illustration of querying multiple SPARQL endpoints and merging data. SPARQL illustration of merging data from two sources Enter the following into the search pane into a SPARQL endpoint (one that is set up to allow SPARQL 1.1 federated searches. It selects landlocked countries from DBPedia and the looks in the World Bank dataset for some of their data with the same DBPedia country identifiers. #find the landlocked countries PREFIX rdfs: <> 9
  10. 10. PREFIX dct: <> PREFIX type: <> PREFIX prop: <> SELECT ?country ?country_name ?capital ?population ?p ?x ?q ?w WHERE { service <> { ?country a type:LandlockedCountries ; rdfs:label ?country_name ; prop:populationEstimate ?population ; prop:capital ?capital . FILTER ( lang(?country_name) = 'en' ) } SERVICE <> {optional {?p ?x ?country. ?p ?q ?w . }} } limit 10 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 10
  11. 11. 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 11