Publishing "5 star" data: the case for RDF


Published on

In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data.  This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • “In the Open Data world we are encouraged to try to publish our data as “5-star” Linked Data because of the semantic richness and ease of integration that the RDF model offers. For many people and organisations this is a new world and some learning and experimenting is required in order to gain the necessary skills and experience to fully exploit this way of working with data.  This workshop will re-assert the case for RDF and provide a guided tour of some examples of RDF publication that can act as a guide to those making a first venture into the field.”Isn’t is awful when we’re trying to communicate and we’re misunderstood. Not only can it lead to problems as a direct result of the misunderstanding, but there can also be quite a bit of hassle in getting things straightened out after the mistake. In the case of Ginger and Fred it nearly became a showstopper.
  • Data sharing within and between enterprises has always been a costly effort – The Open Group reckon “... that between 40% and 80% of application integration effort is spent on resolving semantic issues, a task that typically requires significant human intervention. The expanding use of Service Oriented Architecture (SOA) and Cloud Computing are further increasing the need for semantic interoperability that more efficiently aligns IT systems with business objectives.” ... and naturally, the Open Data programmes have similar issues. This is where RDF is playing a key role, both inside and outside enterprises.
  • There is a lot of talk in the ‘Open Data’ world about “5-star” RDF data and this indicates a meritocratic hierarchy of models of data, so why is RDF ‘tops’? RDF is also key to the “Semantic Web” , also described as “Web 3” – the next generation of web technology. We are also hearing about this “internet of things” and RDF plays a significant role there. So what is there for me, my business, my organisation in considering using data modelled as RDF?RDF (Resource Description Framework) originated in the 1990s as a way of adding metadata to XML documents, but it’s actually also a very tidy way of describing any data. RDF is a model in which data are expressed as triples comprising a Subject and an Object related by a directional Predicate.
  • From a little after the Ginger and Fred era until about the late 1990s interchanges of computerised data tended to follow detailed discussion and agreement between 2 parties about the data they were exchanging, the model systems of the provider and recipient, mappings and semantics relations, etc. etc.. Data exchanges required extraction, transformation and loading stages (ETL) – and this is still the situation in many situations.
  • The RDF model removes much of the heavy lifting required in traditional ETL. “One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and de-duplicated. In my opinion this is the most important benefit of using RDF over other open data formats.” (Ian Davis, 2011)
  • This is an example of one RDF data set
  • And here is another
  • Data entities with the same identifier allow both data sets to merge at these points
  • Another approach to data integration, particularly within enterprises, is to develop one big pot into which all the organisation’s key data fits – the enterprise data warehouse. The problem with this in brittleness – the warehouse takes an age to design, and then one can only fill it with items that it was built to contain. Anything that is the wrong ‘shape’ has to be either rejected, refashioned (ETL), or else be so important as to be worth the cost of adding an extra wing to the warehouse to cater for it. Lots of analysis, coding and so-on.
  • In the RDF world the containers for native RDF are promiscuous, just like file systems, accepting any RDF and not just that which fits a particular schema or pattern. [imagine how restrictive a file system that only could store Word documents would be]. Adding new RDF statements with relationships that are not currently present in the dataset does not require the sort of preparatory work needed in the relational model such as the addition of new join tables to the database.
  • In conventional data management situations the semantics of the data tend to be observed in the interface and expressed in the documentation.
  • RDF, in contrast, has explicit semantics for the relationships between entities or from an entity to a literal, and also provides a mechanism to build in the descriptions of individual classes of entities and descriptions of the dataset.
  • The final aspect of RDF that I want to focus on as a potential benefit is that RDF data is a ‘graph’ of nodes and edges which when visualised as a set of circles and arrows, for example, has a particular shape within which one can see clustering and sparseness in ways that is difficult to achieve with other models.
  • So RDF data is “5 star” because I don’t need to have a dialogue with a range of data providers to unambiguously join together their datasets into a “supergraph” that I can then work with. I don’t necessarily need to modify the container that I use beforehand in order to tailor it to accept data – RDF models can be merged automatically in the absence of a schema. Datasets are, at least to a minimal level, self-describing in that they automatically detect what is the same and what is different, what items are entities and what are properties/relationships. Data becomes collated and de-duplicated automatically.In addition to these arguments in favour of the RDF model for data interchange, the increasing availability of open RDF Linked Data is going to mean that organisations that are not using these approaches are missing out on not being able to effectively make use of openly available RDF from multiple sources in its native, efficient form – they will have to reduce it to a semantically less rich form (JSON, CSV, etc) and this requires ETL steps and so on prior to use.
  • So I am going to give a tour of some resources and illustrations that might be of help to those individuals and organisations wanting to get started in the RDF world
  • First step is education/training at scaleThe Euclid Project [ ]
  • Second step: starting to work with RDF publication:RDF is an efficient way to merge datasets from multiple sources unambiguously and relatively automatically. Think of it as an equivalent to RNA in biology, in that the main store of genetic information is held securely in another form (generally DNA, but also negative strand or double stranded RNA) but for working purposes (building proteins) that data is converted to a biological globally shared form, RNA. In the same way it is possible to hold data all the time as RDF and work effectively with it, but there are many situations where that isn’t optimal – either for historic reasons in that there is existing infrastructure that works effectively on other data models, or – as in highly transactional systems – the RDF approach isn’t suited to the normal operations. Hand written/scriptedOne easy way to create some RDF is to write it by hand or by some simple scripts. This is often used for learning about RDF or for developing new ideas. XTurtlex is an Eclipse plugin that makes this job much smoother. Generating RDF with scripts can be done using a templating approach but, as with constructing XML using specific tools, there are a range of RDF tools for constructing RDF statements in code: Java has Jena and Sesame APIs, Python has RDFlib, Ruby has the RDF.rb gem, and several scripting languages have bindings for the Redland library which is in C.
  • OpenRefine + RDF pluginThis is an application that makes cleanup and conversion from a range of data types, including delimted text and spreadsheets into RDF. There are also options to use ‘reconciliation services’ which are APIs that provide best-guess suggestions for widely used URIs for entities based on text in your data. These reconciliation services come from Freebase (Google) and also recently the Ordnance Survey
  • Conversion from relational databasesThere are several ways in which RDF can be published from relational databases. This is often a good way to get RDF out of an existing system with minimal hassle, and this might be a low-risk way of getting into publishing some of your data as RDF.Methods that use wikis:Wikipedia and DBPedia: Information in Wikipedia factboxes eventually ends up as RDF data published by DBPedia during the biannual conversion process. DBPedia Live attempts to keep abreast of the rapid rate of page updating that is undertaken on WikipediaSemantic Mediawikiis an extension of the Mediawiki software that is used for Wikipedia, that has an underlying RDF model for its data. Semantic Mediawiki provides routes for exporting both subsets of wiki pages and the whole wiki content as RDF.Drupal 7.0 outputs RDFa data in core. RDFa (or Resource Description Framework- in- attributes) is a W3C Recommendation which allows embedding RDF metadata within Web documents. These RDF assertions can be ‘gleaned’ from the web pages by stylesheets and other ‘distillers’. However, RDFa isn’t used much ‘in the wild’ at the moment.
  • Relational to RDF mappersThese act as “babelfish” to translate a relational database schema into an RDF model through a mapping procedure (the applications assist that process, but it often needs hand-finishing) and provide a query interface (i.e. these mapping applications create a SPARQL endpoint and return RDF, but the underlying data is maintained in a SQL database). This approach is ideal for providing RDF from an existing application which you don’t want to mess with but for which you want an RDF output. Examples of relational to RDF mapping software include D2R and the polyglot storage server Virtuoso. Examples of use of these tools include the TellMeScotland open data publication and the EC Joinup pilot linking Belgian addressing data#
  • TriplestoresNative RDF can be stored either as a graph in memory, or within a native RDF triplestore which is a database specifically designed for RDF graph structures. These days we can get computers with huge amounts of RAM at relatively low cost.
  • Native RDF databases a.k.a. TriplestoresRDF can be stored as native triples in SQL databases, using long skinny tables with three columns for the triples and having various indexes (SPO, OSP etc), This is the approach taken with the Jena SDB datastore, but to some extent this was a naive/simplistic approach using tools like MySQL and Postgres that were readily available in the early 2000s. Subsequent work has focused on developing native RDF datastores that don’t use tables in the SQL sense but use node tables and indexes where the focus has been on optimising both storage and search for the RDF model rather than making use of a more generalised data store. Examples include TDB, Sesame and Mulgara all of which are Java applications, and 4Store which is built in C and only easily compiled on Linux. Other approaches include column stores (e.g. Virtuoso and Vertica). So, if you are looking for an easy way of installing and using a triplestore what is the best approach? The Apache Jena “TDB” triplestore with the Joseki or Fuseki SPARQL endpoint is one option I’ve used a lot. Other simple options that I’ve had some experience of include Sesame, Mulgara, Bigdata, Virtuoso and 4Store, but this is not a definitive list, and each has an associated SPARQL over HTTP query option. of triplestore will depend on your OS options (e.g. 4Store is built from source and this is easiest with Linux), how much RDF you are storing (usually measured in millions/billions of triples), and the additional functions (e.g. geo indexing is available with Virtuoso, Parliament and a small number of others; Allegrograph has social networking stats functions built in)One advantage of triplestores over SQL stores is that transferring data from one to another is simply a matter of outputting triples from one and loading them into another. Therefore the risk of picking the ‘wrong one’ to start has limited negative consequences.
  • There are VM images available for some of the geo-capable triplestores – a very useful resource.
  • Linked Data APIsWhen you have data in a triplestore one doesn’t want to just leave potential users with a SPARQL endpoint – it’s daunting and unhelpful to many potential users of your data. A Linked Data API is a much more pleasant decoration. A couple of examples include PublishMyData (mainly Ruby, example at ), Elda (mainly Java, example at ).
  • A Linked Data API provides a faceted HTML view of your data and also helps resolve URIs that have the base URI at your site to some HTML page. For example, the identifier for Victoria Quay is If you put this into your browser you get redirected to an HTML page about Victoria Quay:
  • APIs also help the return of RDF describing resources in different machine readable formats, either by responding to the HTML “Accept” header, or by handling HTTP Server Code 303 re-directs appropriately, e.g.: returns NTriples and returns RDF/XML representations for the same resource.
  • FluidOps WorkbenchThis is a hybrid tool that provides a wiki interface for the creation of new content but also enables the import of data from various sources into a local Sesame triplestore. An example of the Workbench with Wikipedia/DBPedia data is at
  • This shows a timeline & animated GIF for the development of the Linked Open Data web over the past few years
  • DBPedia is at its heart
  • ...with very significant interlinkage with other datasets and a very healthy user base in some major projects
  • So that's the end of the tour
  • Here is an illustration of a federated SPARQL query that initially goes to the DBPedia endpoint <> and finds land-locked countries; it then takes those country identifiers (?country) and goes to the World Bank endpoint <> to find some more information about those countries
  • Publishing "5 star" data: the case for RDF

    1. 1. Peter Winstanley: Holyrood Magazine Open Data Scotland: 10 December 2013
    2. 2. Application Integration Total Effort Semantic Issues
    3. 3. Resource Description Framework RDF. • Initially a way of adding metadata to XML • Subject-Predicate-Object or • Subject-Predicate-Literal triples Scotland has an Authority that is Aberdeen City Population Scotland Authority Aberdeen City “218,220” Aberdeen City has a Population with value “218,220”
    4. 4. E T L Extraction, Transformation and Loading
    5. 5. “One often overlooked advantage that RDF offers is its deceptively simple data model. This data model trivializes merging of data from multiple sources and does it in such a way that data about the same things gets collated and deduplicated. In my opinion this is the most important benefit of using RDF over other open data formats.” (Ian Davis, 2011)
    6. 6. A resource … with the name “Bonnet” …. living in Paris owns … Pet 2 … that is called Sasha “Bonnet” “Sasha”
    7. 7. Pet 2 … is a ferret … and has chicken as the favourite food “chicken” “ferret”
    8. 8. “chicken” The two references to point to the same resource so the graphs can merge. “ferret” “Bonnet” “Sasha”
    9. 9. "Schema up front" design Fact Table Data Cube Change is costly EDW enterprise data warehouse
    10. 10. In contrast... RDF triplestores: • promiscuous • schema-independent
    11. 11. What do the joins mean?
    12. 12. In contrast... RDF has explicit semantics.
    13. 13. RDF - Gross Morphology of Network
    14. 14. So RDF data is “5 star” because No need for prior design discussion with data suppliers about data specification. No need to design container before accepting data. Datasets are self-describing. Explicit semantics. Data deduplicates and collates automatically Merged datasets are collated and de-duplicated automatically.
    15. 15. The Quick Tour 1. 2. 3. 4. 5. Ed/training Creation Storage Publishing Use
    16. 16. Education/training at scale • The Euclid Project - • • • • • • Module 1: Introduction and Application Scenarios Module 2: Querying Linked Data Module 3: Providing Linked Data Module 4: Interaction with Linked Data Module 5: Creating Linked Data Applications Module 6: Scaling up
    17. 17. Creating RDF. Hand written, or scripted • • • • • •
    18. 18. Creating RDF.. GUI Plugins to output RDF 'Reconciliation' services available
    19. 19. Creating RDF... Wikis • Use Wikipedia and let DBPedia work for you • Semantic Mediawiki - outputs RDF and can be linked to triplestore directly • Drupal and DBPedia - creates RDFa which can be scraped, - not very widely used.
    20. 20. Creating RDF.... Relational to RDF mapping • D2R Server: Accessing databases with SPARQL and as Linked Data – • Virtuoso RDF Views –
    21. 21. Large In-Memory Triplestores?
    22. 22. Native RDF Triplestores. • Apache Jena "TDB" • Used in ....
    23. 23. Native RDF Triplestores.. • • • • 4Store Sesame Mulgara Bigdata All provide SPARQL over HTTP, and native APIs
    24. 24. Geospatial Triplestores • • • • • Virtuoso Universal Server (7.0, ColumnStore edition) Parliament (2.7.4 quickstart) uSeekM (1.2.0-a5, on top of PostgreSQL 8.4 and PostGIS 1.5) OWLIM-SE (Trial version 5.3.5849) Strabon (3.2.3, on top of PostgreSQL 8.4 and PostGIS 1.5) Xen VMs for each available in Debian 6 Dr. Jens Lehmann. Uni Leipzig
    25. 25. Linked Data API. PublishMyData Linked Data API
    26. 26. Linked Data API.. ELDA Linked Data API
    27. 27. Linked Data API... Entity Resolution: Victoria Quay is ...resolves
    28. 28. Linked Data API.... Different serialisations [JSON, NT, RDF/XML etc] HTTP "Accept" headers - e.g. "application/json" 303 re-directs
    29. 29. Linked Data API..... • SPARQL is for experts PREFIX rdf: <> PREFIX sepaw: <> PREFIX geo: <> PREFIX rdfs: <> PREFIX sepaloc: <> PREFIX sepaw: <> CONSTRUCT {?item sepaw:waterBodyId ?___0 . ?item sepaw:wiseCode ?___1 . ?item sepaw:inRiverBasinDistrict ?___2 . ?___2 rdfs:label ?___3 . ?item geo:lat ?___4 . ?item sepaw:category ?___5 . ?item sepaw:inSubBasinDistrict ?___6 . ?___6 rdfs:label ?___7 . ?item rdfs:label ?___8 . ?item sepaw:lengthKm ?___9 . ?item sepaw:currentOverallClassification ?___10 . ?item sepaloc:unitaryAuthority ?___11 . ?item geo:long ?___12 . ?item sepaw:inCatchment ?___13 . ?___13 rdfs:label ?___14 . ?item sepaw:currentClassificationYear ?___15 . ?item sepaloc:postcodeDistrict ?___16 . ?item sepaw:areaSqKm ?___17 . } WHERE { {SELECT ?item WHERE { ?item rdf:type sepaw:SurfaceWaterBody . } OFFSET 0 LIMIT 10 }{ ?item sepaw:waterBodyId ?___0 . } UNION { ?item sepaw:wiseCode ?___1 . } UNION {{ ?item sepaw:inRiverBasinDistrict ?___2 . } OPTIONAL { { ?___2 rdfs:label ?___3 . } }} UNION { ?item geo:lat ?___4 . } UNION { ?item sepaw:category ?___5 . } UNION {{ ?item sepaw:inSubBasinDistrict ?___6 . } OPTIONAL { { ?___6 rdfs:label ?___7 . } }} UNION { ?item rdfs:label ?___8 . } UNION { ?item sepaw:lengthKm ?___9 . } UNION { ?item sepaw:currentOverallClassification ?___10 . } UNION { ?item sepaloc:unitaryAuthority ?___11 . } UNION { ?item geo:long ?___12 . } UNION {{ ?item sepaw:inCatchment ?___13 . } OPTIONAL { { ?___13 rdfs:label ?___14 . } }} UNION { ?item sepaw:currentClassificationYear ?___15 . } UNION { ?item sepaloc:postcodeDistrict ?___16 . } UNION { ?item sepaw:areaSqKm ?___17 . } }
    30. 30. Linked Data API.... • Linked Data API makes it easy
    31. 31. FluidOps Workbench & FedX • Built on top of Sesame RDF store • Wiki-like structure for interaction • Data pipelined in from external SPARQL and other sources • Includes widgets, graph views, facet views etc for interacting with the aggregated data
    32. 32. What RDF data is "out there" already?
    33. 33. DBPedia - at the heart of Open Data September 2013 45 million interlinks with Freebase OpenCyc UMBEL GeoNames, Musicbrainz, CIA World Fact Book DBLP Project Gutenberg DBtune Jamendo Eurostat Uniprot Bio2RDF US Census data Also used in Thomson Reuters OpenCalais New York Times Linked Open Data Zemanta API DBpedia Spotlight BBC datasets
    34. 34. Quick Test • http://localhost:3030/sparql-editor.tpl SELECT ?country ?country_name ?capital ?pop ?p ?x ?q ?w WHERE { SERVICE <> { ?country a type:LandlockedCountries ; rdfs:label ?country_name ; prop:populationEstimate ?pop; prop:capital ?capital . FILTER ( lang(?country_name) = 'en' ) } SERVICE <> {optional { ?p ?x ?country. ?p ?q ?w . }} } limit 10