DaPaaS: Enabling Low-cost Open Data
Publishing and Reuse
@ Data Summit Brussels
March 5th, 2015
http://dapaas.eu/
Marin Dimitrov, Ontotext, Bulgaria
Amanda Smith, Open Data Institute, UK
Open Data Benefits
• Businesses can develop new ideas, services and applications;
improve decision making, cost savings
• Can increase government transparency and accountability, quality
of public services
• Citizens get better and timely access to public services
2
Source: McKinsey
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a
nd_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad / public access to it.
Source: Garner
http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data
_JUN+2014_v2.pdf
Lots of open datasets on the Web…
• A large number of open datasets published in the recent years
• Various domains: cultural heritage, science, finance, statistics,
transport and smart cities, environment, …
• Various formats: tabular (e.g. CSV, XLS), HTML/XML, JSON, LOD,
Web APIs…
3
…but few actually used
• Few applications utilizing open and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high
quality data
– Unclear monetization & sustainability
4
Open Data Portal Datasets Applications
data.gov ~ 110 000 ~ 350
publicdata.eu ~ 50 000 ~ 80
data.gov.uk ~ 20 000 ~ 350
data.norge.no ~ 300 ~ 40
Open Data is mostly tabular data
– Records organized in silos of
collections
– Very few links within and/or
across collections
– Difficult to understand the nature
of the data
– Difficult to integrate / query
5
Tabular datasets
publicdata.eu data.gov.uk
Linked Data is great for Open Data
• Linked Data as a great means to represent and integrate
disparate and heterogeneous open data sources
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
• Challenges with using Linked Data
– Lack of tooling & expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
6
DaPaaS: making Open (Linked) Data easier
to use
• A data hosting platform: to make it easy for
publishers to put data on the web
• A data portal: to help advertising data
availability
• Data transfomraiton tools to make it easier
to publish large amounts of high quality data
• Open source tools with high-quality
documentation
7
Make Linked Data more
accessible to everyone!
Key enablers
8
Grafter Grafterizer
(Graphical Tool & DSL)
RDF
database-as-a-service
Open Data Portal
+ PLUQI
Grafter
• Grafter is a DSL and a suite of
tools for data transformation &
cleaning
• Primarily used for handling
data conversions from:
– tabular data formats to tabular
data formats
– tabular data formats to RDF
• “lazy” / stream processing, no
need to load whole dataset
• Robust & efficient for large
scale processing
• Transformations can be
packaged as REST services
• Open Source (EPL)
– http://github.com/swirrl/grafter
– http://grafter.org/
9
Tabular data (spreadsheet)
to RDF Linked Data (graph)
1. Define a pipeline of tabular transformations for data cleaning and
transformation.
2. Create the graph fragments resulting in the generation of an RDF
graph.
10
Grafterizer
• GUI tool for the Grafter suite
• Open Source (EPL)
– github.com/dapaas/grafterizer
12
Use Case: Transformation and Mapping to
RDF
• Import raw data
• Clean up and transform using Grafter / Grafterizer
• Define ontology mapping using Grafterizer
• Generate the RDF graph
Transform
Generate
RDF
Ontology X
Ontology X
Ontology X
Ontology
mapping
RDF
Graph
Raw
Data
Prepared
Data
Map
Map
RDF database-as-a-service
• Enables live data services, instead of static datasets
– A new RDF database can be operational within seconds
• Automated backups, operations, maintenance
• Based on an enterprise-grade RDF database
– Linked Data Fragments servers to be deployed too
• Designed for scalability & availability, in the cloud
• Data import services (Grafter pipelines)
14
Summary
• Open Data has big potential for governments,
enterprises and citizens
• Lots of open datasets available, but very few actually
used
• Linked Data is a promising technology for Open Data,
but still difficult to use for publishers and application
developers
• DaPaaS – enabling low-cost Open (Linked) Data
publishing and reuse
– Platform, portal, methodology, APIs
– Repeatable and scalable data transformations
– Scalable Linked Data hosting
15
http://dapaas.eu
@dapaasproject
dapaas-platform@googlegroups.com
Thank you!
16

Enabling Low-cost Open Data Publishing and Reuse

  • 1.
    DaPaaS: Enabling Low-costOpen Data Publishing and Reuse @ Data Summit Brussels March 5th, 2015 http://dapaas.eu/ Marin Dimitrov, Ontotext, Bulgaria Amanda Smith, Open Data Institute, UK
  • 2.
    Open Data Benefits •Businesses can develop new ideas, services and applications; improve decision making, cost savings • Can increase government transparency and accountability, quality of public services • Citizens get better and timely access to public services 2 Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a nd_performance_with_liquid_information Gartner: By 2016, the use of "open data" will continue to increase — but slowly, and predominantly limited to Type A enterprises. By 2017, over 60% of government open data programs that do not effectively use open data internally, will be scaled back or discontinued. By 2020, enterprises and governments will fail to protect 75% of sensitive data and will declassify and grant broad / public access to it. Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data _JUN+2014_v2.pdf
  • 3.
    Lots of opendatasets on the Web… • A large number of open datasets published in the recent years • Various domains: cultural heritage, science, finance, statistics, transport and smart cities, environment, … • Various formats: tabular (e.g. CSV, XLS), HTML/XML, JSON, LOD, Web APIs… 3
  • 4.
    …but few actuallyused • Few applications utilizing open and distributed datasets at present • Challenges for data consumers – Data quality issues – Difficult or unreliable data access – Licensing issues • Challenges for data publishers – Lack of expertise & resources: not easily to publish & maintain high quality data – Unclear monetization & sustainability 4 Open Data Portal Datasets Applications data.gov ~ 110 000 ~ 350 publicdata.eu ~ 50 000 ~ 80 data.gov.uk ~ 20 000 ~ 350 data.norge.no ~ 300 ~ 40
  • 5.
    Open Data ismostly tabular data – Records organized in silos of collections – Very few links within and/or across collections – Difficult to understand the nature of the data – Difficult to integrate / query 5 Tabular datasets publicdata.eu data.gov.uk
  • 6.
    Linked Data isgreat for Open Data • Linked Data as a great means to represent and integrate disparate and heterogeneous open data sources • How Linked Data can improve Open Data: – Easier integration, free data from silos – Seamless interlinking of data – Understand the data – New ways to query and interact with data • Challenges with using Linked Data – Lack of tooling & expertise to publish high quality Linked Data – Lack of resources to host LOD endpoints / unreliable data access 6
  • 7.
    DaPaaS: making Open(Linked) Data easier to use • A data hosting platform: to make it easy for publishers to put data on the web • A data portal: to help advertising data availability • Data transfomraiton tools to make it easier to publish large amounts of high quality data • Open source tools with high-quality documentation 7 Make Linked Data more accessible to everyone!
  • 8.
    Key enablers 8 Grafter Grafterizer (GraphicalTool & DSL) RDF database-as-a-service Open Data Portal + PLUQI
  • 9.
    Grafter • Grafter isa DSL and a suite of tools for data transformation & cleaning • Primarily used for handling data conversions from: – tabular data formats to tabular data formats – tabular data formats to RDF • “lazy” / stream processing, no need to load whole dataset • Robust & efficient for large scale processing • Transformations can be packaged as REST services • Open Source (EPL) – http://github.com/swirrl/grafter – http://grafter.org/ 9
  • 10.
    Tabular data (spreadsheet) toRDF Linked Data (graph) 1. Define a pipeline of tabular transformations for data cleaning and transformation. 2. Create the graph fragments resulting in the generation of an RDF graph. 10
  • 12.
    Grafterizer • GUI toolfor the Grafter suite • Open Source (EPL) – github.com/dapaas/grafterizer 12
  • 13.
    Use Case: Transformationand Mapping to RDF • Import raw data • Clean up and transform using Grafter / Grafterizer • Define ontology mapping using Grafterizer • Generate the RDF graph Transform Generate RDF Ontology X Ontology X Ontology X Ontology mapping RDF Graph Raw Data Prepared Data Map Map
  • 14.
    RDF database-as-a-service • Enableslive data services, instead of static datasets – A new RDF database can be operational within seconds • Automated backups, operations, maintenance • Based on an enterprise-grade RDF database – Linked Data Fragments servers to be deployed too • Designed for scalability & availability, in the cloud • Data import services (Grafter pipelines) 14
  • 15.
    Summary • Open Datahas big potential for governments, enterprises and citizens • Lots of open datasets available, but very few actually used • Linked Data is a promising technology for Open Data, but still difficult to use for publishers and application developers • DaPaaS – enabling low-cost Open (Linked) Data publishing and reuse – Platform, portal, methodology, APIs – Repeatable and scalable data transformations – Scalable Linked Data hosting 15
  • 16.