Towards a frictionless data future

•Download as PPTX, PDF•

1 like•695 views

Jisc RDM

Presented at the Research Data Network workshop, St Andrews, 30 Nov 2016

Education

Toward a
Frictionless
Data FuturePRESENTED BY
Jo Barratt
jo.barratt@okfn.org (@jobarratt/@okfnlabs)
AT
3rd Research Data Network - St Andrews University - 30
November 2016
Licensed under cc-by v3.0 (any jurisdiction)

International non-profit founded in 2004
Who we are
● Vision
o A world where open knowledge is ubiquitous, enabling
citizens and organizations to create insights that drive
change on global and local challenges, combat injustice
and inequality and hold governments and corporations
to account.
● Mission
o Open up all essential, public interest information and
see it used to create insight that drives positive change
o Build communities, tools and skills to empower
individuals and organizations to use open information to
create insights that drive change.

Widely adopted - over 20 national governments
and 60+ local governments & cities
ckan.org/instances

Frictionless Data is…
● Lightweight specifications for “packaging” datasets
● Integrations for loading datasets into tools and platforms relevant to
researchers
The Goals...
● Introduce a significant, measurable improvement in how research data is
shared, consumed, and analyzed.
● Make it easier to maintain and improve data quality.

Treemap of issues …
Legal barriers
(open data, sharing
agreements etc)
Data Quality
Hard to find
Interoperability
No tool integration

Cargo loading ~1955
Manual, Slow, Costly
(and Dangerous)

Standards
- Standards (a few, simple ones)
- Tools (primarily for integration)
- Documentation
- Datasets
Data
Containerization
http://www.flickr.com/photos/photohome_uk/1494590209/

Key Principles
1. Simplicity
2. Web Oriented
3. Existing Tools
4. Open
5 Distributed

Tabular Data Package
Data Package
JSON Table
Schema
CSV
http://frictionlessdata.io/guides/tabular-data-package/
http://frictionlessdata.io/guides/data-package/
http://frictionlessdata.io/guides/json-table-schema/
Tabular Data Package

Tooling …
Data Package
Tabular Data Package
JSON Table
Schema
CSV
Tool
e.g. import to R, SQL etc
Tool
e.g. data checking ala GoodTables

Validation
http://goodtables.okfnlabs.org

Continuous Validation
● If you’re working in a group, you need continuous validation… for data!
● In < 1 hour, we integrated elements (datapackage.json + Python libraries +
GoodTables API) to support continuous data validation

Partners
View more: http://frictionlessdata.io/partners/
Dataship
YOUR
RESEARCH
ORG

● Project website: http://frictionlessdata.io/
● Specifications: http://specs.frictionlessdata.io/
● GitHub: https://github.com/frictionlessdata/
● User Stories: http://frictionlessdata.io/user-stories/

● Newsletter: http://frictionlessdata.io/get-
involved/#newsletter
● Follow @okfnlabs on Twitter (#frictionlessdata)

What's hot

Gold, silver, bronze - research data networkJisc RDM

Stop press: should embargo conditions apply to metadata?Jisc RDM

Towards Open ResearchJisc RDM

What I wish I’d known at the start!Jisc RDM

Discovering the research data allianceJisc RDM

Frances Burton on sensitive dataJisc RDM

Why does research data matter to librariesJisc RDM

Rubrics for DMPsJisc RDM

Business case and cost modelling for an end-to-end RDM serviceJisc RDM

Opening up data – Jisc and CNI conference 10 July 2014Jisc

Data discovery and sharing at UCLHJisc

Authority files - Jisc Digital Festival 2014Jisc

Research data spring: giving researchers credit for their dataJisc RDM

Connected health citiesJisc

Research data spring: extending the OPD to cover RDMJisc RDM

Certifying and Securing a Trusted Environment for Health Informatics Research...Jisc

Recognising data sharingJisc RDM

Implementing figshare, research data networkJisc RDM

Data sharing in the NetherlandsJisc RDM

A discovery service for UK research dataJisc RDM

What's hot (20)

Gold, silver, bronze - research data network

Stop press: should embargo conditions apply to metadata?

Towards Open Research

What I wish I’d known at the start!

Discovering the research data alliance

Frances Burton on sensitive data

Why does research data matter to libraries

Rubrics for DMPs

Business case and cost modelling for an end-to-end RDM service

Opening up data – Jisc and CNI conference 10 July 2014

Data discovery and sharing at UCLH

Authority files - Jisc Digital Festival 2014

Research data spring: giving researchers credit for their data

Connected health cities

Research data spring: extending the OPD to cover RDM

Certifying and Securing a Trusted Environment for Health Informatics Research...

Recognising data sharing

Implementing figshare, research data network

Data sharing in the Netherlands

A discovery service for UK research data

Viewers also liked

NOMADJisc RDM

SMRUDAS Jisc RDM

Welcome to 3rd Research Data NetworkJisc RDM

Managing Arts and Humanities DataJisc RDM

Managing DOIs across lifecyclesJisc RDM

Grant Funding ProgrammeJisc RDM

Clipper, research data networkJisc RDM

Implementing Archivematica, research data networkJisc RDM

Measuring the costs and benefits of RDM to supporta a business caseJisc RDM

ORDS, research data networkJisc RDM

UK Research Data Discovery Service metadata schemaJisc RDM

Secure Lab at the UK Data ServiceJisc RDM

Viewers also liked (12)

NOMAD

SMRUDAS

Welcome to 3rd Research Data Network

Managing Arts and Humanities Data

Managing DOIs across lifecycles

Grant Funding Programme

Clipper, research data network

Implementing Archivematica, research data network

Measuring the costs and benefits of RDM to supporta a business case

ORDS, research data network

UK Research Data Discovery Service metadata schema

Secure Lab at the UK Data Service

Similar to Towards a frictionless data future

My FAIR share of the work - Diamond Light Source - Dec 2018Susanna-Assunta Sansone

Managing, Sharing and Curating Your Research Data in a Digital Environmentphilipdurbin

OKFN, CKAN & OpenData at #OpenRomaIrina Bolychevsky

A coordinated framework for open data open science in Botswana/Simon HodsonAfrican Open Science Platform

OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018Pedro Sousa

ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck

A coordinated framework for open data open science in Botswana/Simon HodsonAfrican Open Science Platform

Data Management and Horizon 2020Sarah Jones

Infrastructure, relationships, trust, and RDAResearch Data Alliance

Open Data is Not Enough: Making Data Sharing WorkResearch Data Alliance

Linked Open Data_mlanet13Kristi Holmes

Exploration, visualization and querying of linked open data sourcesLaura Po

Sharing Advisory Board newsletter #8Carlo Vaccari

Managing and sharing dataSarah Jones

Open Sesame: Open Data, Data Liberation and Opportunities for LibrariansCommunication and Media Studies, Carleton University

Open Science Globally: Some Developments/Dr Simon HodsonAfrican Open Science Platform

EDF2012 Rufus Pollock - Open Data. Where we are where we are goingEuropean Data Forum

Biomedical Data Science: We Are Not AlonePhilip Bourne

Open Data is not EnoughResearch Data Alliance

Intro to RDMSarah Jones

Similar to Towards a frictionless data future (20)

My FAIR share of the work - Diamond Light Source - Dec 2018

Managing, Sharing and Curating Your Research Data in a Digital Environment

OKFN, CKAN & OpenData at #OpenRoma

A coordinated framework for open data open science in Botswana/Simon Hodson

OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018

ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data

A coordinated framework for open data open science in Botswana/Simon Hodson

Data Management and Horizon 2020

Infrastructure, relationships, trust, and RDA

Open Data is Not Enough: Making Data Sharing Work

Linked Open Data_mlanet13

Exploration, visualization and querying of linked open data sources

Sharing Advisory Board newsletter #8

Managing and sharing data

Open Sesame: Open Data, Data Liberation and Opportunities for Librarians

Open Science Globally: Some Developments/Dr Simon Hodson

EDF2012 Rufus Pollock - Open Data. Where we are where we are going

Biomedical Data Science: We Are Not Alone

Open Data is not Enough

Intro to RDM

Recently uploaded

Roles & Responsibilities in PharmacovigilanceSamikshaHamane

DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

Full Stack Web Development Course for BeginnersSabitha Banu

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Raw materials used in Herbal Cosmetics.pptxAshokrao Mane college of Pharmacy Peth-Vadgaon

Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

Field Attribute Index Feature in Odoo 17Celine George

Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1

Influencing policy (training slides from Fast Track Impact)Mark Reed

How to Add Barcode on PDF Report in Odoo 17Celine George

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña

Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan

Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2

Recently uploaded (20)

Roles & Responsibilities in Pharmacovigilance

DATA STRUCTURE AND ALGORITHM for beginners

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

Full Stack Web Development Course for Beginners

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY

Keynote by Prof. Wurzer at Nordex about IP-design

Raw materials used in Herbal Cosmetics.pptx

Barangay Council for the Protection of Children (BCPC) Orientation.pptx

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS

Field Attribute Index Feature in Odoo 17

Q4 English4 Week3 PPT Melcnmg-based.pptx

Influencing policy (training slides from Fast Track Impact)

How to Add Barcode on PDF Report in Odoo 17

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION

Judging the Relevance and worth of ideas part 2.pptx

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

Gas measurement O2,Co2,& ph) 04/2024.pptx

Grade 9 Q4-MELC1-Active and Passive Voice.pptx

Towards a frictionless data future

1. Toward a Frictionless Data FuturePRESENTED BY Jo Barratt jo.barratt@okfn.org (@jobarratt/@okfnlabs) AT 3rd Research Data Network - St Andrews University - 30 November 2016 Licensed under cc-by v3.0 (any jurisdiction)

2. International non-profit founded in 2004 Who we are ● Vision o A world where open knowledge is ubiquitous, enabling citizens and organizations to create insights that drive change on global and local challenges, combat injustice and inequality and hold governments and corporations to account. ● Mission o Open up all essential, public interest information and see it used to create insight that drives positive change o Build communities, tools and skills to empower individuals and organizations to use open information to create insights that drive change.

3. Widely adopted - over 20 national governments and 60+ local governments & cities ckan.org/instances

4. £4m in 15mins

5. Frictionless Data is… ● Lightweight specifications for “packaging” datasets ● Integrations for loading datasets into tools and platforms relevant to researchers The Goals... ● Introduce a significant, measurable improvement in how research data is shared, consumed, and analyzed. ● Make it easier to maintain and improve data quality.

6. The Problem

7. Treemap of issues … Legal barriers (open data, sharing agreements etc) Data Quality Hard to find Interoperability No tool integration

8. Cargo loading ~1955 Manual, Slow, Costly (and Dangerous)

9. Data is Shipping Pre- Containerization

10. Containerization

11. Standards - Standards (a few, simple ones) - Tools (primarily for integration) - Documentation - Datasets Data Containerization http://www.flickr.com/photos/photohome_uk/1494590209/

12. Key Principles 1. Simplicity 2. Web Oriented 3. Existing Tools 4. Open 5 Distributed

13.

14. Tabular Data Package Data Package JSON Table Schema CSV http://frictionlessdata.io/guides/tabular-data-package/ http://frictionlessdata.io/guides/data-package/ http://frictionlessdata.io/guides/json-table-schema/ Tabular Data Package

15. Tooling … Data Package Tabular Data Package JSON Table Schema CSV Tool e.g. import to R, SQL etc Tool e.g. data checking ala GoodTables

16. Validation http://goodtables.okfnlabs.org

17. Validation

18. Continuous Validation ● If you’re working in a group, you need continuous validation… for data! ● In < 1 hour, we integrated elements (datapackage.json + Python libraries + GoodTables API) to support continuous data validation

19. http://uk-25k.datadashboards.io/

20. Platform Integrations

21. Partners View more: http://frictionlessdata.io/partners/ Dataship YOUR RESEARCH ORG

22. ● Project website: http://frictionlessdata.io/ ● Specifications: http://specs.frictionlessdata.io/ ● GitHub: https://github.com/frictionlessdata/ ● User Stories: http://frictionlessdata.io/user-stories/

23. ● Newsletter: http://frictionlessdata.io/get- involved/#newsletter ● Follow @okfnlabs on Twitter (#frictionlessdata)

Editor's Notes

Hello. I’m Jo Barratt I’m the project manager for the frictionless Data project at Open Knowledge. I have worked as a journalist and and in branding, and I tell you this just by way of explaining that I’m not a coder So I’m here to present an overview of the technology and some of the tooling and projects we are working on. But please do come and talk to me because there are a team of people really keen to talk to you about this.
Open Knowledge. Here’s a bit of info for those of you who might not have come across us before Open Knowledge is an non-profit founded in 2004. It’s based in the UK, though our team is spread around the world. I’m in London, which is where a few other people are, and there and we have people in Berlin, Tel Aviv, Adis Ababa to name but a few. Our vision is a world where open knowledge is ubiquitous, and citizens and organizations are able to create insights from knowledge that drive change. We’ve had over a decade of experience working with organizations to unlock the value of their data.
A lot of our work is based in government. CKAN worlds leading open data portal.
I probably don’t need to do very much to persuade people here of the value of open data. But I’m going to start with a story because it shows firstly the value of opening data, but also a bigger issue, which is one we are looking to solve with the work we are doing now. And it’s a true story. it shows how the UK Government saved four million pounds in fifteen minutes. As citizens it’s quite easy for most people to understand the concept of how the data can benefit them. Even if data exists, and there is a willingness to share it, friction may prevent it “flowing” to where it is most needed and most valuable. Im sure working with universites you understand this. Universities AND governments are prime examples of institutions who are often show great willingness to make data available, but culture or infrastructure make this difficult. I this example although it is open data we’re actually talking about sharing data within government. In 2010, the UK began publishing transactional government spending data as open data. New! Didn’t know the benefits. Three and a half years later, Liam Maxwell, the UK Government CTO has an idea that maybe people in other departments are using a report he’s interested in and so they are duplicating the purchase, needlessly. In the old days this would not be worth his time. Open data = Problem solved right? Not quite. Each department published their data in separate monthly CSV files. It was difficult to understand and laborious to search through to find the information he was looking for. But luckily, someone! conveniently made these searchable whole on the OpenSpending website. Three clicks. Total estimated savings from eliminating that duplication: just over 4 million pounds. Open data had allowed him to turn a question into insight in minutes - and saved the government and taxpayer money in doing so. So the innovation is not that the data is open. But the fact that is was open meant that another organisation could come and build something which could come and link it all up. So the real value is that is it possible to find the data and use it. People have knowledge. We know things. This needs to flow between to people in order to make use of this information, build on it and drive change and support the projects we are working on. We need a better way of matching everything up. Look at academic research. Universities are the places where human knowledge is advanced and pushed to the next level. There are a lot of people, working on a lot of different things across the world but how is this information shared? A paper is published and you have to hope that somebody reads it and then, tells somebody else about it. There is so much out there that is not being used. ANYBODY should be able to use information to generate insight. The best thing to do with your data will be thought of by someone else
Frictionless Data is about removing the friction in working with data. We are developing a set of tools, specifications, and best practices for processing, describing, and publishing data. The heart of this project is the “Data Package”, a containerization format that’s based on existing practices for publishing open-source software.
But Now, i let’s look a little closer at the issues at the specific problems we have working with with data. Openness is not the only one. There is more and more data being opened up, and more and more data available in all sorts of different places and formats. but, there are additional problems involved in getting hold of, sharing, and using this data in research - or for whatever you want to use it for. There is FRICTION in the process.
And we’ve done a rough and ready proportional representation this. On the Left here are all the things which are stopping people wanting to release the data. So the reasons for this might be economic. Or political. And we’ve made huge progress here in the last decade. From my experience working with government, this feels like a battle which is already won. It’s the other side of the chat which is a bit more of a worry. Hard to discover or find. Structure is poor and/or needs significant manipulation to be usable so even when you do find data, it is usually a manual process to connect it with your tool. No standardized schema and different data sources are hard to compare or integrate And it might be impossible to get it into the tool you are using. Which might mean laborious time spent coping data form one tool to another.
In 1955 shipping was slow, it was dangerous and it was complicated. But then came modern shipping containers. On 26 April 1956, the Ideal X, made its maiden voyage from Port Newark to Houston stacked with 58 metal containers. They we’re taking orders before the had docked and the enterprise expanded into what became knows as Sea-Land Services. Suddenly a ship could be unloaded with a tiny proportion of the men it too before, reducing the price and the cost. It wasn’t long before other companies began to adopt the same specifications. Use increased quickly and because it made sense for everybody, standardisation followed quickly. In 1961 the ISO set the sizes for Shipping containers which are still pretty much in use today.
Data is Shipping Pre-Containerization Which is where the Frictionless Data project and Data Packages come in
DP = Metal box FD = cranes
The main idea behind the Data Package format is to create a common interchange format: publishers can publish their data as a Data Package while consumers can integrate Data Packages into their research workflow. They can just plug it in. Whether that involves an SQL database, BigQuery, analysis using Python, pandas, or Excel. With Data Packages, data publishers only need to support this common container format to simplify export to an ever-increasing number of other tools and services. Getting these elements of the global data infrastructure right can reduce the friction experienced by researchers who work with data. We believe this will result in improved data quality, use, and sharing leading to more insight. Now to run though some of the key points about our approach here.
They are simple. They use the most basic formats. They are Web Orientated. Use formats that are web "native" (JSON) and work naturally with HTTP (e.g. CSV streams). We have designed the data package to work as easily as possible with existing tools. Everyone has tools to use CSV and its supported by almost every language. Why would we want to change the approach here. They are open. Anyone should be able to freely and openly use and reuse what we build. Our community is open to everyone. And they are Distributed: This is not about creating a central data registry, but rather a basic framework that would enable anyone to publish and use high quality datasets more easily. Through this approach, we aim to revolutionize how research data is shared, consumed, and analyzed while also enabling massive improvements in data quality.
Here is a DP – or a representation of one. In real life, this is code. This is a “tabular” data package. Which is designed specifically for tabular data DP for different groups. Different ways of working with data, different common metadata etc. DP - the box carton, with some basic descriptions of the data CSV - the data No, or very few or no changes to existing data. It all goes into the data package. JSON Table Schema – lets us know what we expect to see in the csv file.
And once you have the data package, you can play with all the tools we are creating. Or import to other tools directly. Just like with Shipping containerisation it’s not the metal box which is the innovation.
Here one of the first tools we have built. Goodtables Much of the “friction” in using the data comes from the time and effort needed to identify and address these errors before analyzing it in in a given tool. . So we’re focussing on this early stage in the process. To remove the friction. 2 things you can do here You can upload a data set which will test for structural errors in table (e.g. missing headers, blank rows, etc.) And you have the option to test against a Schema errors which pre defines what we expect to be in the fields.
Now upload… Shows what is wrong with your data how to fix it in a user friendly fashion. So again, this is not big data, not complicated data, but it vastly reduced the friction involved and if we are fixing this part of the chain, it makes the job of people further along, a lot more straightforward! For example we are exploring this in a working group we have set up around Archeological data. They are really quite excited about the difference this can make in the filed. At the stage the data is becoimng digitised. But now with there’s another step in the process which is CONTINUOUS validation, and the benefits are multiplied.
Software projects have long benefited from Continuous Integration services like Travis CI for making sure code is of a high quality. With every update to a bit of code Tests are automatically run and a report is generated to a project’s shared repository. And developers can find and resolve errors quickly and reliably. As with software, datasets are created, edited, and updated over time, and by different people. And with continuous validation using goodtables, we do exactly the same. A set of tests are run on the data using goodtables, If “bad” data, is used the “build” fails and issues a report indicating what went wrong. So it can be fixed and we can build stronger, better data. This is not just for science or match subjects where you will eventually go on and imput this into a complex analysis tool. Imagine you have a series of questionnaires on a google form. You can use good tables to instantly check a survey which could save you hours and hours of time reorganising your data.
Data Quality Dashboard Something we built for UK gov, to compare quality of openly published data across department. Ou can tell just by looking at this, that the issue is not the openness of the data, but its quality. This can do several things. Keep organisations in check. Allow people to get better. Specific issues with the data. Also building a registry which will allow you to publish data from the command line, and will support much of our other work we are doing. ALL FREE OPEN SOURCE.
Our overall mission is to make it easier to develop tools and services for working with data and also to ensure greater interoperability between new and existing tools and services. A KEY part of it is to speak to people and find out what they need to help make their lives easier working with data. We now already support import/export for: CKAN, BigQuery, AWS Redshift, SQL, … We are building libraries in Python, Ruby, Matlab, R which will allow users to easily get data into a proper backend for further use And what else? You tell us. In piloting our approach, we are placing a particular emphasis on supporting researchers in addressing their existing data needs across various scientific disciplines.
We are running targeted pilots to trial these tools and specifications on real data Are you a researcher looking for better tooling to manage your data? Are youworking on research data and would like to work with us on issues for which data packages are suited? Are you a developer and have an idea for something we can build together? Talk to us. We have time and attention to give you (and funding!)
Talk to me!

Towards a frictionless data future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Towards a frictionless data future

Similar to Towards a frictionless data future (20)

More from Jisc RDM

More from Jisc RDM (20)

Recently uploaded

Recently uploaded (20)

Towards a frictionless data future

Editor's Notes