Afternoon everyone, I wanted to take the opportunity today to tell you about the DataIncubator project, provide an overview of what the project is about and some of the datasets that we've accumulated so far.
This community has had a huge amount of success in bootstrapping the Linked Data cloud. There's a definite sense of momentum, and the size of the audience today testifies to the growing interest in the technology. Personally I think that the key challenges ahead relate to demonstrating the usefulness of that data. Now that we have it, what can we do with it? BUT we shouldn't lose sight of the fact that there's still a huge amount of evangelism to be done, and a great deal of data that could still be part of the web of data. In short we need to keep up the process of accumulating data in as many different subject areas and disciplines as possible.
The DataIncubator aims to help achieve that, by adopting the same “show don't tell” approach that has worked to date. i.e. actually convert some data, and show how it can be used and inter-linked. Practical evangelism, you might call it.
A goal of the project is to come up with a sustainable way to manage these dataset conversions that makes it increasingly easy to carry them out, and to show the benefits. But also, crucially, that it is possible for the original owners or curators of the data to build on the initial community efforts. The DataIncubator project was started by Ian Davis. And Ian, myself and a number of our other colleagues have been involved in getting the first datasets converted. The project isn't a formal Talis project though. It just happens that in our off hours from building an platform for publishing Linked Data, we like to relax by, erm, publishing Linked Data. (I'm really trying not to think about that too closely).
But where Talis is supporting the project is through the Connected Commons scheme. This is an initiative that we launched about 6 months ago with the aim of helping to support these kinds of boot-strapping efforts, as well as providing a sustainable approach for publishing open, public domain data. Something which we're quite passionate about. The deal with the scheme is that, if your data is published, using one of the two available open data licenses, then you can use the Platform for free. This gives you a online triple store, complete with SPARQL endpoint and integrated search engine which you can use without any service usage limits. There's a soft limit of 50m triples there currently, this is arbitrary, but provides plenty of space to play with. So this is one way that the project is achieving sustainability by building on this offer from Talis
There are also some community norms that we're hoping to build around the process of converting and publishing datasets. Hopefully some of these can carry through...
The first of these is to ensure that there is a sufficient amount of linking and attribution. Every dataset should reference its original sources, not just at a high-level, e.g. in the VOID description, but at a deeper level, so resources can be associated with, for example, the original web pages that describe them, etc. Attribution is an important community norm that we should be adopting anyway, but its especially important in this context as we want to ensure that the original curators of the data don't think that the community is trying to appropriate or steal its work. Quite the opposite, we want them to embrace it.
The other aim is to ensure that both the code and the data are open. The original data owners can build on the work of the community by making use of the effort put into the modelling and inter-linking of the data with other sources. This helps to lower the barrier of entry for an organization wanting to expose Linked Data. A lot of the ground work is in place. By ensuring that the data conversion and hosting code is open there's also potential for the original owner to use that too. This is unlikely, especially if the code is just scraping a website. But by ensuring that the code is open it means that if the original evangelist in the dataincubator community gets bored or moves on, then other people can build on their original work much more easily. This spreads the maintenance cost and makes it easier to manage the converted dataset. There are quite a few RDF and LinkedData conversions littered around the web that use the same data but apply different models. And its hard to tell which is most used or which project is most active. We should be taking steps to avoid that.
The other aspect is ensure that the data links are as stable as possible. The ideal outcome is that a data owner will see the benefits of embracing Linked Data, at which point the data in DataIncubator and the Platform becomes obsolete. We won't need a secondary source any longer. So the project aims to commit to supporting this by redirecting URIs from the incubated data to its new primary home.
OK, so thats the basics of what we're attempting. A continuation of the Linking Open Data project but with an eye on making it easier to build community of people around managing the conversion of specific datasets But whats actually be done so far?
One of my projects is a conversion of the Nasa Space Science Dataset. This currently provides access to over 6000 space flight launches and descriptions of a slightly higher number of satellites. My overall goal is to try and knit together the wide range of open data published by NASA into a more coherent whole. Currently its scattered across a number of different sites and services. I've also supplemented the data with information on the Apollo missions so its possible to find out who played which role on which mission. Should be heaven for space geeks. Pun intended.
The OpenLibrary project provides a number of data exports, including Linked Data. But despite some recent improvements, the modelling of the data is not ideal. OpenLibrary also only exposes a subset of the underlying bibliographic data, some of which was originally donated to the project by Talis. This incubation effort aims to show some alternative ways that the OL data could be modelled and published.
The Linked Periodicals dataset currently holds data about academic journals and publications. The data set is a merge of data from the National Library of Medicine, Highwire, and CrossRef (a not-for-profit who work in the publishing industry). The project was started as a result of someone in the research community looking for an integrated set of journal data. So we took the NLM data and converted that. Highwire then offered up their journal lists for us to add to it, and CrossRef have done the same for their publisher lists. CrossRef are also relicensing the original data to make it more open as a result of this. Nice small example of the process in action.
Discogs is a conversion of the discogs.com community managed music database. Its similar in scope to MusicBrainz but comes from a community orientated towards trading and selling records and music. The core data is in the public domain and so far I've converted and loaded data on over a million artists and 600,000 different music labels. There is a huge amount of additional data about music releases, tracks, and roles that artists have played in writing, recording, etc the music. I've put up around 250,000 releases worth of data so far to get some feedback on the modelling.
Finally there is the Airports dataset which is a conversion of the data published by ourairports.com. This is another community managed dataset that has data on thousands of different airports and includes a rich set of information including information on runways, links to Yahoo Weather reports, etc.
So thats where we're at currently. I've got a personal wish-list of additional datasets I'd like to convert. And this seems like the ideal audience to ask for help
The first is the Prelinger Archives. This is part of the Internet Archive (which is a whole other dataset of its own). The archives consist of over 2000 industrial, educational, travel, and propaganda videos published from 1903 to the 1970s. The content is completely in the public domain so it just begging to be converted. Would be a great dataset on which to explore modelling of media, media annotations and the like.
The other is Lego. Its a continual disappointment to me that there's not some Lego data on the Linked Data web. Surely the confluence of geekery involved demands that it happens? To my joy and shame I've already scoped this one out and there's a huge amount of open data ranging from the pantone colours of individual lego bricks through to complete parts lists for every lego sets. All of which has been crowd-sourced. Wouldn't it be cool to be able to navigate all that? And then maybe someone can apply some reasoning to tell me which lego sets I can build with parts I already have?
I've even started making in-roads into evangelising to the core Lego community. Here's the result of me trying to teach my son about the core RDF model. We were using Star Wars because he's a domain expert in that. The irony is that within a few minutes he was criticising my modelling.
So thats enough from me. If you're interested in learning more about dataincubator, the Talis Connected Commons or the Platform then I'll be around for the rest of the day.
DataIncubator.org What Is It? And What's In It? Leigh Dodds London Linked Data Meetup 9 th September 2009 http://creativecommons.org/licenses/by/2.0/uk/