1. Toward a
Frictionless
Data FuturePRESENTED BY
Jo Barratt
jo.barratt@okfn.org (@jobarratt/@okfnlabs)
AT
3rd Research Data Network - St Andrews University - 30
November 2016
Licensed under cc-by v3.0 (any jurisdiction)
2. International non-profit founded in 2004
Who we are
● Vision
o A world where open knowledge is ubiquitous, enabling
citizens and organizations to create insights that drive
change on global and local challenges, combat injustice
and inequality and hold governments and corporations
to account.
● Mission
o Open up all essential, public interest information and
see it used to create insight that drives positive change
o Build communities, tools and skills to empower
individuals and organizations to use open information to
create insights that drive change.
3. Widely adopted - over 20 national governments
and 60+ local governments & cities
ckan.org/instances
5. Frictionless Data is…
● Lightweight specifications for “packaging” datasets
● Integrations for loading datasets into tools and platforms relevant to
researchers
The Goals...
● Introduce a significant, measurable improvement in how research data is
shared, consumed, and analyzed.
● Make it easier to maintain and improve data quality.
18. Continuous Validation
● If you’re working in a group, you need continuous validation… for data!
● In < 1 hour, we integrated elements (datapackage.json + Python libraries +
GoodTables API) to support continuous data validation
Hello. I’m Jo Barratt
I’m the project manager for the frictionless Data project at Open Knowledge.
I have worked as a journalist and and in branding, and I tell you this just by way of explaining that I’m not a coder
So I’m here to present an overview of the technology and some of the tooling and projects we are working on. But please do come and talk to me because there are a team of people really keen to talk to you about this.
Open Knowledge. Here’s a bit of info for those of you who might not have come across us before
Open Knowledge is an non-profit founded in 2004. It’s based in the UK, though our team is spread around the world. I’m in London, which is where a few other people are, and there and we have people in Berlin, Tel Aviv, Adis Ababa to name but a few.
Our vision is a world where open knowledge is ubiquitous, and citizens and organizations are able to create insights from knowledge that drive change.
We’ve had over a decade of experience working with organizations to unlock the value of their data.
A lot of our work is based in government.
CKAN worlds leading open data portal.
I probably don’t need to do very much to persuade people here of the value of open data. But I’m going to start with a story because it shows firstly the value of opening data, but also a bigger issue, which is one we are looking to solve with the work we are doing now.
And it’s a true story. it shows how the UK Government saved four million pounds in fifteen minutes.
As citizens it’s quite easy for most people to understand the concept of how the data can benefit them.
Even if data exists, and there is a willingness to share it, friction may prevent it “flowing” to where it is most needed and most valuable. Im sure working with universites you understand this.
Universities AND governments are prime examples of institutions who are often show great willingness to make data available, but culture or infrastructure make this difficult.
I this example although it is open data we’re actually talking about sharing data within government.
In 2010, the UK began publishing transactional government spending data as open data.
New!
Didn’t know the benefits.
Three and a half years later, Liam Maxwell, the UK Government CTO has an idea that maybe people in other departments are using a report he’s interested in and so they are duplicating the purchase, needlessly.
In the old days this would not be worth his time.
Open data = Problem solved right?
Not quite. Each department published their data in separate monthly CSV files. It was difficult to understand and laborious to search through to find the information he was looking for.
But luckily, someone! conveniently made these searchable whole on the OpenSpending website.
Three clicks.
Total estimated savings from eliminating that duplication: just over 4 million pounds. Open data had allowed him to turn a question into insight in minutes - and saved the government and taxpayer money in doing so.
So the innovation is not that the data is open. But the fact that is was open meant that another organisation could come and build something which could come and link it all up.
So the real value is that is it possible to find the data and use it.
People have knowledge. We know things. This needs to flow between to people in order to make use of this information, build on it and drive change and support the projects we are working on. We need a better way of matching everything up.
Look at academic research. Universities are the places where human knowledge is advanced and pushed to the next level. There are a lot of people, working on a lot of different things across the world but how is this information shared? A paper is published and you have to hope that somebody reads it and then, tells somebody else about it.
There is so much out there that is not being used.
ANYBODY should be able to use information to generate insight.
The best thing to do with your data will be thought of by someone else
Frictionless Data is about removing the friction in working with data.
We are developing a set of tools, specifications, and best practices for processing, describing, and publishing data.
The heart of this project is the “Data Package”, a containerization format that’s based on existing practices for publishing open-source software.
But Now, i let’s look a little closer at the issues at the specific problems we have working with with data. Openness is not the only one.
There is more and more data being opened up, and more and more data available in all sorts of different places and formats. but, there are additional problems involved in getting hold of, sharing, and using this data in research - or for whatever you want to use it for.
There is FRICTION in the process.
And we’ve done a rough and ready proportional representation this. On the Left here are all the things which are stopping people wanting to release the data. So the reasons for this might be economic. Or political.
And we’ve made huge progress here in the last decade. From my experience working with government, this feels like a battle which is already won. It’s the other side of the chat which is a bit more of a worry.
Hard to discover or find.
Structure is poor and/or needs significant manipulation to be usable so even when you do find data, it is usually a manual process to connect it with your tool.
No standardized schema and different data sources are hard to compare or integrate
And it might be impossible to get it into the tool you are using. Which might mean laborious time spent coping data form one tool to another.
In 1955 shipping was slow, it was dangerous and it was complicated.
But then came modern shipping containers.
On 26 April 1956, the Ideal X, made its maiden voyage from Port Newark to Houston stacked with 58 metal containers.
They we’re taking orders before the had docked and the enterprise expanded into what became knows as Sea-Land Services. Suddenly a ship could be unloaded with a tiny proportion of the men it too before, reducing the price and the cost.
It wasn’t long before other companies began to adopt the same specifications.
Use increased quickly and because it made sense for everybody, standardisation followed quickly.
In 1961 the ISO set the sizes for Shipping containers which are still pretty much in use today.
Data is Shipping Pre-Containerization
Which is where the Frictionless Data project and Data Packages come in
DP = Metal box
FD = cranes
The main idea behind the Data Package format is to create a common interchange format: publishers can publish their data as a Data Package while consumers can integrate Data Packages into their research workflow.
They can just plug it in. Whether that involves an SQL database, BigQuery, analysis using Python, pandas, or Excel.
With Data Packages, data publishers only need to support this common container format to simplify export to an ever-increasing number of other tools and services.
Getting these elements of the global data infrastructure right can reduce the friction experienced by researchers who work with data. We believe this will result in improved data quality, use, and sharing leading to more insight.
Now to run though some of the key points about our approach here.
They are simple. They use the most basic formats.
They are Web Orientated. Use formats that are web "native" (JSON) and work naturally with HTTP (e.g. CSV streams).
We have designed the data package to work as easily as possible with existing tools. Everyone has tools to use CSV and its supported by almost every language.
Why would we want to change the approach here.
They are open. Anyone should be able to freely and openly use and reuse what we build. Our community is open to everyone.
And they are Distributed: This is not about creating a central data registry, but rather a basic framework that would enable anyone to publish and use high quality datasets more easily.
Through this approach, we aim to revolutionize how research data is shared, consumed, and analyzed while also enabling massive improvements in data quality.
Here is a DP – or a representation of one. In real life, this is code.
This is a “tabular” data package. Which is designed specifically for tabular data
DP for different groups. Different ways of working with data, different common metadata etc.
DP - the box carton, with some basic descriptions of the data
CSV - the data No, or very few or no changes to existing data. It all goes into the data package.
JSON Table Schema – lets us know what we expect to see in the csv file.
And once you have the data package, you can play with all the tools we are creating.
Or import to other tools directly.
Just like with Shipping containerisation it’s not the metal box which is the innovation.
Here one of the first tools we have built. Goodtables
Much of the “friction” in using the data comes from the time and effort needed to identify and address these errors before analyzing it in in a given tool.
.
So we’re focussing on this early stage in the process. To remove the friction.
2 things you can do here
You can upload a data set which will test for structural errors in table (e.g. missing headers, blank rows, etc.)
And you have the option to test against a Schema errors which pre defines what we expect to be in the fields.
Now upload…
Shows what is wrong with your data
how to fix it in a user friendly fashion.
So again, this is not big data, not complicated data, but it vastly reduced the friction involved and if we are fixing this part of the chain, it makes the job of people further along, a lot more straightforward!
For example we are exploring this in a working group we have set up around Archeological data. They are really quite excited about the difference this can make in the filed. At the stage the data is becoimng digitised.
But now with there’s another step in the process which is CONTINUOUS validation, and the benefits are multiplied.
Software projects have long benefited from Continuous Integration services like Travis CI for making sure code is of a high quality.
With every update to a bit of code Tests are automatically run and a report is generated to a project’s shared repository.
And developers can find and resolve errors quickly and reliably.
As with software, datasets are created, edited, and updated over time, and by different people.
And with continuous validation using goodtables, we do exactly the same. A set of tests are run on the data using goodtables,
If “bad” data, is used the “build” fails and issues a report indicating what went wrong.
So it can be fixed and we can build stronger, better data.
This is not just for science or match subjects where you will eventually go on and imput this into a complex analysis tool. Imagine you have a series of questionnaires on a google form. You can use good tables to instantly check a survey which could save you hours and hours of time reorganising your data.
Data Quality Dashboard
Something we built for UK gov, to compare quality of openly published data across department. Ou can tell just by looking at this, that the issue is not the openness of the data, but its quality.
This can do several things.
Keep organisations in check.
Allow people to get better. Specific issues with the data.
Also building a registry which will allow you to publish data from the command line, and will support much of our other work we are doing.
ALL FREE OPEN SOURCE.
Our overall mission is to make it easier to develop tools and services for working with data and also to ensure greater interoperability between new and existing tools and services.
A KEY part of it is to speak to people and find out what they need to help make their lives easier working with data.
We now already support import/export for:
CKAN, BigQuery, AWS Redshift, SQL, …
We are building libraries in Python, Ruby, Matlab, R which will allow users to easily get data into a proper backend for further use
And what else?
You tell us.
In piloting our approach, we are placing a particular emphasis on supporting researchers in addressing their existing data needs across various scientific disciplines.
We are running targeted pilots to trial these tools and specifications on real data
Are you a researcher looking for better tooling to manage your data?
Are youworking on research data and would like to work with us on issues for which data packages are suited?
Are you a developer and have an idea for something we can build together?
Talk to us. We have time and attention to give you (and funding!)