DataGraft: Data-as-a-Service for Open Data

DataGraft
Data-as-a-Service for Open Data
Dumitru Roman
dumitru.roman@sintef.no
https://datagraft.net

About me
• Education
– Eng (2003), Technical University of Cluj-Napoca, Romania
– PhD (2008), University of Innsbruck, Austria
• Current positions
– Senior Research Scientist, SINTEF, Norway
– Associate Professor, University of Oslo, Norway
• Expertise and responsibilities
– Initiating, leading, and carrying out (research-intensive) projects on
data management and service-oriented topics
– Involved with over 20 large-scale R&D projects at the European level
during the past 12 years
2

“Technology for a better society”
• Public and private
companies
• Data owners
• Data publishers
• Data integrators and
aggregators
• Developers
• Improved data access
• Data-driven decision making
• Cost reduction when
working with data
• Reduction on the
dependency on generic
infrastructures providers
(e.g. generic cloud)
• Increase in the speed of
making data available
• Increase in the reuse of data
• Data cleaning
• Data transformation
• Data publication
• Data-as-a-Service
• Open data
• Linked data (RDF, SPARQL)
DataGraft
3

Outline
Session #1: Open Data
• Open Data
• (Open) Data Quality Issues
• Linked (Open) Data
– RDF, RDFS, SPARQL
Session #2: DataGraft
• Data-as-a-Service: DataGraft
• Examples and Demo
• Big Data and DataGraft
• Open Data in Malaysian
context (by Dennis Gan)
• (Optional: Hands on)
5
What is Open Data?
What is Linked Data?
Challenges in (Linked Open) Data?
How to publish Linked Open Data?
Linked Open Data Use Cases?
(Linked) Open Data and Big Data?

What can open data do for you?
(Source: The ODI, https://vimeo.com/110800848)
7

Open Data
…is changing the nature of business
...reflects a cultural shift to a more open
society
8

Example: Personalized and Localized Urban
Quality Index (PLUQI)
The index includes data from various
domains:
Daily life satisfaction
weather, transportation, community, …
Healthcare level
number of doctors, hospitals, suicide statistics, …
Safety and security
number of police stations, fire stations, crimes
per capita, …
Financial satisfaction
prices, incomes, housing, savings, debt,
insurance, pension, …
Level of opportunity
jobs, unemployment, education, re-education, …
Environmental needs and efficiency
green space, air quality,…
9

PLUQI – potential usage
• Place recommendation for travel agencies or travelers
• Policy analysis and optimization for (local) government
• Understanding the citizen’s voice and demands regarding
environmental conservation
• Commercial impact analysis for retailer and franchises
• Location recommendation and understanding local issues
for real estate
• Risk analysis and management for insurance and
financial companies
• Local marketing and sales force optimization for
marketers
10

Open Data
• Businesses can develop new ideas, services and applications;
improve decision making, cost savings
• Can increase government transparency and accountability, quality
of public services
• Citizens get better and timely access to public services
11
Source: McKinsey
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a
nd_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad/public access to it.
Source: Garner
http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data
_JUN+2014_v2.pdf

Lots of open datasets on the Web…
• A large number of datasets have been published as open data in the
recent years
• Many kinds of data: cultural, science, finance, statistics, transport,
environment, …
• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …
12

…but few actually used
• Few applications utilizing open
and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high
quality data
– Unclear monetization & sustainability
13
Open Data Portal Datasets Applications
data.gov ~ 200 000 ~ 80
publicdata.eu ~ 48 000 ~ 85
data.gov.uk ~ 31 000 ~ 390
data.norge.no ~ 620 ~ 60
data.gov.my ~ 1065 ~ 10

Lots of datasets are in tabular format
– Records organized in silos of
collections
– Very few links within and/or
across collections
– Difficult to understand the nature
of the data
– Difficult to integrate / query
14
europeandataportal.eu

Openly
available on
the web as a
document
Available
under
structured
format (XLS)
Available
under non-
proprietary
formats (CSV)
Uses URIs to
denote things
Linked to other
data to provide
context
Tim Berners-Lee's
5 stars open data
rating system
15

1-Star Benefits
Consumers:
 Ability to look at, print,
store, modify and
share data
 Ability to use data as
input to a system
Publishers:
 Easily publish data
 Ensure transparency
5-Star Benefits
Consumers:
 Discover more (related) data while
consuming the data
 Directly learn about the data schema
? Have to deal with broken data links
? Trust issues
Publishers:
 Make data discoverable
 Increase the value of data
 Gain the same benefits from the links
as the consumers
? Need to invest resources to link data
? May need to clean data
16
…

Tabular Data Graph Data
• Lots of open datasets are in tabular format
• CSV, Excel, TSV, etc.
• Records organized in silos of collections
• Very few links within and/or across
collections
• Difficult to understand the nature of the data
• Difficult to integrate / query
Based on Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
• Open standards by W3C
− Data format: RDF
− Knowledge representation: RDFS/OWL
− Query language: SPARQL
http://www.w3.org/standards/semanticweb/data
europeandataportal.eu
17

Tabular data
Tabular data is data that is structured into rows and columns
Correspondence with reality:
1) Each row represents an entity
2) Each column header represents an attribute of entity
3) Each column value represents a value of attribute
4) Each table represents a collection of entities
20

Tabular data files
Tabular data can be stored in different formats:
 Tabular Text Formats (pure tabular data)
Delimiter-separated values:
- CSV – comma-separated values
- Less common, including TSV – tab-separated values, colon-separated
values etc.
 Spreadsheet Formats (meta-data information about the document,
tabular data, formulas)
- XLS (Excel spreadsheet)
- XLSX (Excel 2007 format)
21

Tabular data quality issues
When a dataset does not satisfy specified data quality
criteria, it means that it contains data quality issues.
In order to provide higher data quality, these quality
issues should be detected and removed.
22

What types of data quality issues can occur?
26

Types of quality issues
Actual information model:
order
street
house
28

Actual information model:
order
has address
address
29

Data model:
observation
has make
make
31

Data model:
observation
make
year
number 32

Summary of data quality issues
33

How to resolve data quality issues?
Workflow:
1) Identify data quality issues
2) Define transformation functions to resolve them
3) Execute transformation and verify the result
34

Transformation function types
By scope:
 Functions on rows
 Functions on columns
 Functions transforming entire
dataset
By caused effect:
 Data reordering functions
 Data extraction functions
 Data manipulation functions
 Data enrichment functions
35

Transformation functions
Scope Name Description Effect
Rows
Add Row Create a new record in a dataset Data enrichment
Take/Drop Rows Extract only relevant rows by index
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection
Filter Rows Extract only relevant rows by condition
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Entire
dataset
Remove
Duplicates
Remove similar rows Data extraction. Resolves issues: “Duplicate rows”
Sort Dataset
Sorts dataset by given column names in
given order
Data reordering, simplifies quality issues detection
Reshape Dataset
(Melt)
Move columns to rows
Data manipulation. Resolves issues: “Column headers, containing
attribute values”
Reshape Dataset
(Cast)
Move rows to columns by categorizing
and aggregating
Data enrichment, simplifies quality issues detection
Group and
Aggregate
Group values by column or multiple
columns and perform aggregation
Data enrichment, simplifies quality issues detection
Columns
Add Column
Add a column with a manually specified
value
Data enrichment
Derive Column
Add a column with values, computed
from other columns
Data enrichment
Take/Drop
Columns
Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”
Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection
Merge Columns Merge columns using custom separator
Data manipulation. Resolves issues: “Single value is splitted across
multiple columns”
Split Column Split column using custom separator
Data manipulation. Resolves issues: “Multiple values stored in one
column”
Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”
Map columns Apply function to all values in a column
Data manipulation. Resolves issues: “Illegal values”, “Missing values”,
“Inconsistent values” 36

Tabular data cleaning tools
 CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface
 Programming languages and libraries for data analysis (R, agate for
Python) – users need knowledge in programming
 Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google
Spreadsheets) - were not initially created for data cleaning, hard to debug,
code is mixed up with data
 Frameworks/tools designed to be used for interactive data cleaning and
transformation in ETL process
37

Example: vehicle registration data
https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&C
MSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true
38

Example: vehicle registration data
(continued)
* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39

Map columns – applying a function to all
values in a column
Effect: data manipulation
Resolves anomalies: Illegal values, Missing values, Inconsistent values
Required parameters:
For all columns that should be mapped
1) Name of column to manipulate
2) Name of function to apply
40

Before:
Map columns – apply function to all values in
a column
41

After:
Map columns – apply function to all values in
a column
42

Derive column – add a column with values
computed from others
Effect: data enrichment
Adds new information to data
1) Name of derived column
2) Column(s) to derive from
3) Function to derive with
43

Before:
44

After:
45

Cast dataset – move rows to columns by
categorizing and aggregating
Effect: data enrichment
Adds new information to data, simplifies anomaly detection
1) Column name for variable (what to categorize and put to headers)
2) Column name for value (on what to perform aggregations)
46

Before:
47

After:
48

RDF mapping
Reusing of existing vocabularies is encouraged. Helps to interlink data.
49

RDF mapping
http://vocabs.datagraft.net/vehicles
51

Linked (Open) Data
RDF, RDFS, SPARQL

Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
http://www.w3.org/standards/sema
nticweb/data
53

Linked open data cloud
By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792
54

Linked Data principles
• Every thing is represented by a URI
• URIs of things can be dereferenced
• Things are linked to other things by relating their URIs
55

Linked Data technology
• Data format:
• Knowledge representation: RDFS/OWL
• Query language:
• Linking medium: HTTP
56

Graph data structure
Alice
Jim
Peter
57

RDF in reality: using URLs to identify things
58

Resource Description Framework (RDF)
Basics
• RDF making statements on resources (entities)
o Triple data model: subject -> predicate -> object (Alice's age is 34)
• Subjects and objects:
o Resources (URIs of entities) – can have properties related to them (http://my-
domain.com/Alice)
o Literals – constant values ("female", "3.14159"); can not be subjects
o Blank nodes – used to specify composite properties (e.g., address which is composed
of a country, city, street name, house number, zip code etc.)
• Realtionships (a.k.a. predicates) – relate one subject to one object
59

RDF serialisation formats
• Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads)
60
<http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> .
<http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator>
<http://dbpedia.org/resource/Leonardo_da_Vinci> .
<http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject>
<http://www.wikidata.org/entity/Q12418> .
• JSON-LD (JSON-based RDF syntax)
"@context": "example-context.json",
"@id": "http://example.org/bob#me",
"@type": "Person",
"birthdate": "1990-07-04",
"knows": "http://example.org/alice#me",
"interest": {
"@id": "http://www.wikidata.org/entity/Q12418",
"title": "Mona Lisa",
"subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619",
"creator": "http://dbpedia.org/resource/Leonardo_da_Vinci"
}

RDF serialisation formats (continued)
• RDFa (for HTML and XML embedding)
61
<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">
<div resource="http://example.org/bob#me" typeof="foaf:Person">
<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>
and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>
<p>Bob is interested in <span property="foaf:topic_interest"
resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>
</div>
<div resource="http://www.wikidata.org/entity/Q12418">
<p>The <span property="dcterms:title">Mona Lisa</span> was painted by
<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>
and is the subject of the video
<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à
Washington'</a>. </p>
</div>
<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>
</div>
</body>

RDF serialisation formats (continued)
• RDF/XML (XML syntax for RDF)
62
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="http://example.org/bob#me">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate>
<foaf:knows rdf:resource="http://example.org/alice#me"/>
<foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
<rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">
<dcterms:title>Mona Lisa</dcterms:title>
<dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>
</rdf:Description>
<rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
</rdf:RDF>

RDF Schema (RDFS)
• basic capabilities for describing RDF vocabularies
• includes concepts to describe:
o classes, class hierarchies (sub-classes) and instances (typing)
o non-standard literal data types
o property hierarchies (sub-properties)
o predicate domain and range
o utility properties (labels, comments, additional information
about things, definitions of reources)
o …
63

Linked data vocabulary sources
64

Querying RDF: SPARQL
• RDF Query language
– Based on graph matching
• Uses SQL-like syntax
• Query types:
– SELECT – table of raw values
– CONSTRUCT, DESCRIBE – RDF graph
– ASK – boolean
65

SPARQL querying – example graph
a:Alice c:Jimb:Peter
foaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
66

SPARQL querying – query
Question: What are the nicknames of people that Alice knows?
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
a:Alice
foaf:knows
?someone
foaf:nick
?nickname
67

SPARQL querying – matching to the graph
foaf:Person
rdf:type
foaf:knows
68

SPARQL querying – result
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
nickname
"Pety"
"Jimbo"
69

Data integration using Linked Data: using
URIs
Example: Relational DB or spreadsheet – dataset about scientific publications:
ID Name Home page
1 Alice http://alice.org/
2 Tim https://www.w3.org/People/Berners-Lee/
ID author ISBN Publication topic
1 978-3-16-14410-0 "On the frictional coefficient of bananas"
1
534-1-22-66975-1
"Do woodpeckers get headaches?"
2 1-933019-33-6 "The Semantic Web"
70

Data integration using Linked Data: using
URIs (continued)
a:Alice
http://.../978-3-16-148410-0
http://.../534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of
bananas"
"Do woodpeckers get headaches?"
t:Tim http://.../1-933019-33-6
foaf:publications
foaf:topic
"The Semantic Web"
Graph representation of new dataset:
71

Data integration using Linked Data: Using
URIs (continued)
Same URI!
72

Data integration using Linked Data: Using
URIs (continued)
foaf:Person
rdf:type
foaf:knows
…978-3-16-148410-0
…534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient
of bananas"
"Do woodpeckers get
headaches?"
Resulting graph:
73

Query federation using SPARQL
74

Linked Data is great for Open Data
• Linked Data is a great means to represent data
– Semantics are part of the data
– Naturally linked to other data
– Querying language
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
75

… but has been ignored by the mainstream
• Difficult to make it accessible to people
– Publishers
– Developers
– Data workers
• Challenges with using Linked Data
– Lack of tooling and expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
• DataGraft: packaging Linked Data to make it more
approachable to the open data community
76

78
“Data is the new oil”
…but many of us just need gasoline
Data-as-a-Service
…is the new filling station

Data-as-a-Service
• Outsourcing of various data operations to the cloud
• Eliminates
– upfront costs on data infrastructure
– ongoing investment of time and resources in managing the data
infrastructure
• Complete package for
– transformation of raw data into meaningful data assets
– reliable delivery of data assets
79

was developed to allow
data workers
to manage their data in a
simple, effective, and efficient way
Powerful
data transformation and
reliable data access capabilities
80
DataGraft

Data Transformation and
RDF Publication Process
• Interactive design of transformations?
• Repeatable transformations?
• Reuse/share transformations (user-based access)?
• Cloud-based deployment of transformations?
• Self-serviced process?
• Data and Transformation as-a-Service? 81
Transform
Generate
RDF
Ontology X
Ontology X
Ontology X
Ontology
mapping
RDF Graph
Raw Data Prepared Data
Map
Map
RDF Triple
Store

Tabular
Data
Graph
Data
DataGraft: Data-as-a-Service
For the Data Transformation and RDF Publication Process
82

83
https://www.ssb.no/statistikkbanken
Example: Using statistical data

102
Data records (rows)
Add row
Take row(s)
Drop row(s)
Shift row
Filter rows (grep)
Remove duplicate rows
Entire dataset
Sort
Reshape dataset
Group (categorize) and aggregate
Columns
Add column(s)
Take column(s)
Drop column(s)
Move column
Merge columns
Split column
Rename column(s)
Apply function to all values in a column

Data pages and federated querying
108
What is the
population of
locations and
total number of
persons employed
in Human health
and social work
activities?

Configuring data visualizations
109

DataGraft key feature:
Flexible management and sharing of data
and transformations
Fork, reuse and extend
transformations built by other
professionals from DataGraft’s
transformations catalog
Interactively build,
modify and share data
transformations
Share transformations
privately or publicly
Reuse transformations to
repeatably clean and
transform spreadsheet
data
Programmatically access transformations
and the transformation catalogue
114

Reuse of transformations in environmental
data publishing
TRAGSA Pilot
• Number of
transformations: 42
– Created via reuse: 25
• Number of triples:
– ~ 7.7M
ARPA Pilot
• Number of
transformations: 5
– Created via reuse: 2
• Number of triples:
– ~ 14K
115
Forking/reusing transformations helped us spend less
time on creating new transformations

DataGraft key feature:
Reliable data hosting and querying services
Host data on DataGraft’s
reliable, cloud-based
semantic graph database
Share data privately or
publicly
Query data through
your own SPARQL
endpoint
Programmatically
access the data
catalogue
116
Operations & maintenance
performed on behalf of users

Grafter Grafterizer
Semantic
Graph DBaaSData Portal
DataGraft
117
DataGraft Enablers

DataGraft – 1 package 2 audiences
DataGraft
Data Publisher Application Developer
Helping
integrating and
publishing data
Giving better,
easier tools
118

The context: Statsbygg
120
• A public sector administration
company
• Norwegian government's key
advisor in construction and
property affairs
• Building commissioner
• Property manager
• Property developer
• Interest:
Exploit/Share
property data in
novel ways
• For efficiency and sustainability of
the property included in the
government's civil estate
Example: Reporting state-owned
real estate properties in Norway

Example: Reporting state-owned
real estate properties in Norway (cont’)
• A hard copy of 314 pages and as a
PDF file
• 6 Person-Months
• Data collection with spreadsheets
• Quality assurance through e-mails
and phone correspondence
Pains
• Time consuming
• Poor data quality
• Static report without live updating
• Live service
• Efficient sharing of data
• Simplified integration with external
datasets
• Live updating
• Reliable access
• …
• Risk and vulnerability analysis,
e.g. buildings affected by
flooding
• Analysis of leasing prices
Report Reporting Service 3rd party services
121

Sample data
122
Cleaning, Transformation, Publishing,
Integration, Querying, Visualization,
Service Access

Demo Scenario
• Interactively create tabular data transformations
• Reuse/extend data transformations (incl. data
annotations)
• RDF data publication and querying
• Integrating and visualising data from different
sources
• (Using 3rd party tools with DataGraft)
123

Demo sample data
124
Service Access

Demo sample data
125
Service Access

Benefits of DataGraft in use cases
• Simplified data publishing process
• Integration with external data sources using
established web standards
• Data that was not publicly available – now published
(e.g. air quality data in Oslo)
• Time-efficient publishing
• Repeatable data transformation process
126

DataGraft and Big Data
• Desired features:
– real-time interactivity
– large datasets batch transformation capability
We are developing a hybrid solution to work with both
batch and real-time processing.
127

DataGraft and Big Data:
High-level architecture
128

DataGraft – targeted impacts
Reduction in costs
for organisations which lack
sufficient expertise and resources to
make their data available
Reduction on the dependency
of data owners on generic Cloud platforms
to build, deploy and maintain their linked
data from scratch
Increase in the speed of
publishing
new datasets and updating existing
datasets
Reduction in the cost and
complexity of developing
applications that use data
Increase in the reuse of data
by providing reliable access to numerous
datasets hosted on DataGraft.net
129

• Gathering enough of good datasets
• Designing/implementing
2. Able to focus on
service quality
Example: The benefit of DataGraft in PLUQI
130
• Reducing cost for implementing
transformations
• Integrating the process is
simpler
1. 23% of development
cost reduction
Datasets
gathering
Data
transformation
Data
provisioning/access
Implementing
App
Before
Datasets
gathering
Data
transformation
Data
provisioning/
access
Implementing
App
After (with DataGraft)

DataGraft in numbers
(as of end of Jan 2016)
131
238
Registered users
607 (208 public)
Registered
Data transformations
1828
Uploaded files
192
Public Data
pages

DataGraft in the wild
• Investigating crime data in small geographies
• Used DataGraft to transform data and publish RDF
132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/

Data Science and DataGraft
Greater Data Science:
1. Data Exploration and
Preparation
2. Data Representation and
Transformation
3. Computing with Data
4. Data Visualization and
Presentation
5. Data Modeling
6. Science about Data Science
133
“50 years of Data Science” by David Donoho
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
DataGraft

134https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/

Summary
• DataGraft – emerging Data-as-a-Service solution for
making (linked) data more accessible
– Platform, portal, methodology, APIs
– Online service, functional and documented
– Validated through several use cases
• Key features:
– Support for Sharable/Repeatable/Reusable Data
Transformations
– Reliable RDF Database-as-a-Service
136

https://datagraft.net
Thank you!
Contact: dumitru.roman@sintef.no 137

DataGraft: Data-as-a-Service for Open Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DataGraft: Data-as-a-Service for Open Data

Similar to DataGraft: Data-as-a-Service for Open Data (20)

More from dapaasproject

More from dapaasproject (6)

Recently uploaded

Recently uploaded (20)

DataGraft: Data-as-a-Service for Open Data