The document provides an overview of Linked (Open) Data including RDF, RDFS and SPARQL. It defines key concepts such as Linked Data principles of using URIs to identify things on the web and describing relationships between them. It describes RDF's basic data model of subject-predicate-object triples to make statements about resources and the RDF serialization formats of Turtle and JSON-LD. It also mentions semantic query language SPARQL for querying RDF data.
2. About me
• Education
– Eng (2003), Technical University of Cluj-Napoca, Romania
– PhD (2008), University of Innsbruck, Austria
• Current positions
– Senior Research Scientist, SINTEF, Norway
– Associate Professor, University of Oslo, Norway
• Expertise and responsibilities
– Initiating, leading, and carrying out (research-intensive) projects on
data management and service-oriented topics
– Involved with over 20 large-scale R&D projects at the European level
during the past 12 years
2
3. “Technology for a better society”
• Public and private
companies
• Data owners
• Data publishers
• Data integrators and
aggregators
• Developers
• Improved data access
• Data-driven decision making
• Cost reduction when
working with data
• Reduction on the
dependency on generic
infrastructures providers
(e.g. generic cloud)
• Increase in the speed of
making data available
• Increase in the reuse of data
• Data cleaning
• Data transformation
• Data publication
• Data-as-a-Service
• Open data
• Linked data (RDF, SPARQL)
DataGraft
3
5. Outline
Session #1: Open Data
• Open Data
• (Open) Data Quality Issues
• Linked (Open) Data
– RDF, RDFS, SPARQL
Session #2: DataGraft
• Data-as-a-Service: DataGraft
• Examples and Demo
• Big Data and DataGraft
• Open Data in Malaysian
context (by Dennis Gan)
• (Optional: Hands on)
5
What is Open Data?
What is Linked Data?
Challenges in (Linked Open) Data?
How to publish Linked Open Data?
Linked Open Data Use Cases?
(Linked) Open Data and Big Data?
7. What can open data do for you?
(Source: The ODI, https://vimeo.com/110800848)
7
8. Open Data
…is changing the nature of business
...reflects a cultural shift to a more open
society
8
9. Example: Personalized and Localized Urban
Quality Index (PLUQI)
The index includes data from various
domains:
Daily life satisfaction
weather, transportation, community, …
Healthcare level
number of doctors, hospitals, suicide statistics, …
Safety and security
number of police stations, fire stations, crimes
per capita, …
Financial satisfaction
prices, incomes, housing, savings, debt,
insurance, pension, …
Level of opportunity
jobs, unemployment, education, re-education, …
Environmental needs and efficiency
green space, air quality,…
9
10. PLUQI – potential usage
• Place recommendation for travel agencies or travelers
• Policy analysis and optimization for (local) government
• Understanding the citizen’s voice and demands regarding
environmental conservation
• Commercial impact analysis for retailer and franchises
• Location recommendation and understanding local issues
for real estate
• Risk analysis and management for insurance and
financial companies
• Local marketing and sales force optimization for
marketers
10
11. Open Data
• Businesses can develop new ideas, services and applications;
improve decision making, cost savings
• Can increase government transparency and accountability, quality
of public services
• Citizens get better and timely access to public services
11
Source: McKinsey
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a
nd_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad/public access to it.
Source: Garner
http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data
_JUN+2014_v2.pdf
12. Lots of open datasets on the Web…
• A large number of datasets have been published as open data in the
recent years
• Many kinds of data: cultural, science, finance, statistics, transport,
environment, …
• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …
12
13. …but few actually used
• Few applications utilizing open
and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high
quality data
– Unclear monetization & sustainability
13
Open Data Portal Datasets Applications
data.gov ~ 200 000 ~ 80
publicdata.eu ~ 48 000 ~ 85
data.gov.uk ~ 31 000 ~ 390
data.norge.no ~ 620 ~ 60
data.gov.my ~ 1065 ~ 10
14. Lots of datasets are in tabular format
– Records organized in silos of
collections
– Very few links within and/or
across collections
– Difficult to understand the nature
of the data
– Difficult to integrate / query
14
europeandataportal.eu
15. Openly
available on
the web as a
document
Available
under
structured
format (XLS)
Available
under non-
proprietary
formats (CSV)
Uses URIs to
denote things
Linked to other
data to provide
context
Tim Berners-Lee's
5 stars open data
rating system
15
16. 1-Star Benefits
Consumers:
Ability to look at, print,
store, modify and
share data
Ability to use data as
input to a system
Publishers:
Easily publish data
Ensure transparency
5-Star Benefits
Consumers:
Discover more (related) data while
consuming the data
Directly learn about the data schema
? Have to deal with broken data links
? Trust issues
Publishers:
Make data discoverable
Increase the value of data
Gain the same benefits from the links
as the consumers
? Need to invest resources to link data
? May need to clean data
16
…
17. Tabular Data Graph Data
• Lots of open datasets are in tabular format
• CSV, Excel, TSV, etc.
• Records organized in silos of collections
• Very few links within and/or across
collections
• Difficult to understand the nature of the data
• Difficult to integrate / query
Based on Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
• Open standards by W3C
− Data format: RDF
− Knowledge representation: RDFS/OWL
− Query language: SPARQL
http://www.w3.org/standards/semanticweb/data
europeandataportal.eu
17
20. Tabular data
Tabular data is data that is structured into rows and columns
Correspondence with reality:
1) Each row represents an entity
2) Each column header represents an attribute of entity
3) Each column value represents a value of attribute
4) Each table represents a collection of entities
20
21. Tabular data files
Tabular data can be stored in different formats:
Tabular Text Formats (pure tabular data)
Delimiter-separated values:
- CSV – comma-separated values
- Less common, including TSV – tab-separated values, colon-separated
values etc.
Spreadsheet Formats (meta-data information about the document,
tabular data, formulas)
- XLS (Excel spreadsheet)
- XLSX (Excel 2007 format)
21
22. Tabular data quality issues
When a dataset does not satisfy specified data quality
criteria, it means that it contains data quality issues.
In order to provide higher data quality, these quality
issues should be detected and removed.
22
34. How to resolve data quality issues?
Workflow:
1) Identify data quality issues
2) Define transformation functions to resolve them
3) Execute transformation and verify the result
34
35. Transformation function types
By scope:
Functions on rows
Functions on columns
Functions transforming entire
dataset
By caused effect:
Data reordering functions
Data extraction functions
Data manipulation functions
Data enrichment functions
35
36. Transformation functions
Scope Name Description Effect
Rows
Add Row Create a new record in a dataset Data enrichment
Take/Drop Rows Extract only relevant rows by index
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection
Filter Rows Extract only relevant rows by condition
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Entire
dataset
Remove
Duplicates
Remove similar rows Data extraction. Resolves issues: “Duplicate rows”
Sort Dataset
Sorts dataset by given column names in
given order
Data reordering, simplifies quality issues detection
Reshape Dataset
(Melt)
Move columns to rows
Data manipulation. Resolves issues: “Column headers, containing
attribute values”
Reshape Dataset
(Cast)
Move rows to columns by categorizing
and aggregating
Data enrichment, simplifies quality issues detection
Group and
Aggregate
Group values by column or multiple
columns and perform aggregation
Data enrichment, simplifies quality issues detection
Columns
Add Column
Add a column with a manually specified
value
Data enrichment
Derive Column
Add a column with values, computed
from other columns
Data enrichment
Take/Drop
Columns
Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”
Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection
Merge Columns Merge columns using custom separator
Data manipulation. Resolves issues: “Single value is splitted across
multiple columns”
Split Column Split column using custom separator
Data manipulation. Resolves issues: “Multiple values stored in one
column”
Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”
Map columns Apply function to all values in a column
Data manipulation. Resolves issues: “Illegal values”, “Missing values”,
“Inconsistent values” 36
37. Tabular data cleaning tools
CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface
Programming languages and libraries for data analysis (R, agate for
Python) – users need knowledge in programming
Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google
Spreadsheets) - were not initially created for data cleaning, hard to debug,
code is mixed up with data
Frameworks/tools designed to be used for interactive data cleaning and
transformation in ETL process
37
38. Example: vehicle registration data
https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&C
MSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true
38
39. Example: vehicle registration data
(continued)
* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39
40. Map columns – applying a function to all
values in a column
Effect: data manipulation
Resolves anomalies: Illegal values, Missing values, Inconsistent values
Required parameters:
For all columns that should be mapped
1) Name of column to manipulate
2) Name of function to apply
40
43. Derive column – add a column with values
computed from others
Effect: data enrichment
Adds new information to data
Required parameters:
1) Name of derived column
2) Column(s) to derive from
3) Function to derive with
43
46. Cast dataset – move rows to columns by
categorizing and aggregating
Effect: data enrichment
Adds new information to data, simplifies anomaly detection
Required parameters:
1) Column name for variable (what to categorize and put to headers)
2) Column name for value (on what to perform aggregations)
46
53. Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
http://www.w3.org/standards/sema
nticweb/data
53
54. Linked open data cloud
By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792
54
55. Linked Data principles
• Every thing is represented by a URI
• URIs of things can be dereferenced
• Things are linked to other things by relating their URIs
55
56. Linked Data technology
• Data format:
• Knowledge representation: RDFS/OWL
• Query language:
• Linking medium: HTTP
56
59. Resource Description Framework (RDF)
Basics
• RDF making statements on resources (entities)
o Triple data model: subject -> predicate -> object (Alice's age is 34)
• Subjects and objects:
o Resources (URIs of entities) – can have properties related to them (http://my-
domain.com/Alice)
o Literals – constant values ("female", "3.14159"); can not be subjects
o Blank nodes – used to specify composite properties (e.g., address which is composed
of a country, city, street name, house number, zip code etc.)
• Realtionships (a.k.a. predicates) – relate one subject to one object
59
61. RDF serialisation formats (continued)
• RDFa (for HTML and XML embedding)
61
<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">
<div resource="http://example.org/bob#me" typeof="foaf:Person">
<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>
and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>
<p>Bob is interested in <span property="foaf:topic_interest"
resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>
</div>
<div resource="http://www.wikidata.org/entity/Q12418">
<p>The <span property="dcterms:title">Mona Lisa</span> was painted by
<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>
and is the subject of the video
<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à
Washington'</a>. </p>
</div>
<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>
</div>
</body>
63. RDF Schema (RDFS)
• basic capabilities for describing RDF vocabularies
• includes concepts to describe:
o classes, class hierarchies (sub-classes) and instances (typing)
o non-standard literal data types
o property hierarchies (sub-properties)
o predicate domain and range
o utility properties (labels, comments, additional information
about things, definitions of reources)
o …
63
67. SPARQL querying – query
Question: What are the nicknames of people that Alice knows?
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
a:Alice
foaf:knows
?someone
foaf:nick
?nickname
67
70. Data integration using Linked Data: using
URIs
Example: Relational DB or spreadsheet – dataset about scientific publications:
ID Name Home page
1 Alice http://alice.org/
2 Tim https://www.w3.org/People/Berners-Lee/
ID author ISBN Publication topic
1 978-3-16-14410-0 "On the frictional coefficient of bananas"
1
534-1-22-66975-1
"Do woodpeckers get headaches?"
2 1-933019-33-6 "The Semantic Web"
70
71. Data integration using Linked Data: using
URIs (continued)
a:Alice
http://.../978-3-16-148410-0
http://.../534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of
bananas"
"Do woodpeckers get headaches?"
t:Tim http://.../1-933019-33-6
foaf:publications
foaf:topic
"The Semantic Web"
Graph representation of new dataset:
71
75. Linked Data is great for Open Data
• Linked Data is a great means to represent data
– Semantics are part of the data
– Naturally linked to other data
– Querying language
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
75
76. … but has been ignored by the mainstream
• Difficult to make it accessible to people
– Publishers
– Developers
– Data workers
• Challenges with using Linked Data
– Lack of tooling and expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
• DataGraft: packaging Linked Data to make it more
approachable to the open data community
76
78. 78
“Data is the new oil”
…but many of us just need gasoline
Data-as-a-Service
…is the new filling station
79. Data-as-a-Service
• Outsourcing of various data operations to the cloud
• Eliminates
– upfront costs on data infrastructure
– ongoing investment of time and resources in managing the data
infrastructure
• Complete package for
– transformation of raw data into meaningful data assets
– reliable delivery of data assets
79
80. was developed to allow
data workers
to manage their data in a
simple, effective, and efficient way
Powerful
data transformation and
reliable data access capabilities
80
DataGraft
81. Data Transformation and
RDF Publication Process
• Interactive design of transformations?
• Repeatable transformations?
• Reuse/share transformations (user-based access)?
• Cloud-based deployment of transformations?
• Self-serviced process?
• Data and Transformation as-a-Service? 81
Transform
Generate
RDF
Ontology X
Ontology X
Ontology X
Ontology
mapping
RDF Graph
Raw Data Prepared Data
Map
Map
RDF Triple
Store
102. 102
Data records (rows)
Add row
Take row(s)
Drop row(s)
Shift row
Filter rows (grep)
Remove duplicate rows
Entire dataset
Sort
Reshape dataset
Group (categorize) and aggregate
Columns
Add column(s)
Take column(s)
Drop column(s)
Move column
Merge columns
Split column
Rename column(s)
Apply function to all values in a column
108. Data pages and federated querying
108
What is the
population of
locations and
total number of
persons employed
in Human health
and social work
activities?
114. DataGraft key feature:
Flexible management and sharing of data
and transformations
Fork, reuse and extend
transformations built by other
professionals from DataGraft’s
transformations catalog
Interactively build,
modify and share data
transformations
Share transformations
privately or publicly
Reuse transformations to
repeatably clean and
transform spreadsheet
data
Programmatically access transformations
and the transformation catalogue
114
115. Reuse of transformations in environmental
data publishing
TRAGSA Pilot
• Number of
transformations: 42
– Created via reuse: 25
• Number of triples:
– ~ 7.7M
ARPA Pilot
• Number of
transformations: 5
– Created via reuse: 2
• Number of triples:
– ~ 14K
115
Forking/reusing transformations helped us spend less
time on creating new transformations
116. DataGraft key feature:
Reliable data hosting and querying services
Host data on DataGraft’s
reliable, cloud-based
semantic graph database
Share data privately or
publicly
Query data through
your own SPARQL
endpoint
Programmatically
access the data
catalogue
116
Operations & maintenance
performed on behalf of users
120. The context: Statsbygg
120
• A public sector administration
company
• Norwegian government's key
advisor in construction and
property affairs
• Building commissioner
• Property manager
• Property developer
• Interest:
Exploit/Share
property data in
novel ways
• For efficiency and sustainability of
the property included in the
government's civil estate
Example: Reporting state-owned
real estate properties in Norway
121. Example: Reporting state-owned
real estate properties in Norway (cont’)
• A hard copy of 314 pages and as a
PDF file
• 6 Person-Months
• Data collection with spreadsheets
• Quality assurance through e-mails
and phone correspondence
Pains
• Time consuming
• Poor data quality
• Static report without live updating
• Live service
• Efficient sharing of data
• Simplified integration with external
datasets
• Live updating
• Reliable access
• …
• Risk and vulnerability analysis,
e.g. buildings affected by
flooding
• Analysis of leasing prices
Report Reporting Service 3rd party services
121
123. Demo Scenario
• Interactively create tabular data transformations
• Reuse/extend data transformations (incl. data
annotations)
• RDF data publication and querying
• Integrating and visualising data from different
sources
• (Using 3rd party tools with DataGraft)
123
126. Benefits of DataGraft in use cases
• Simplified data publishing process
• Integration with external data sources using
established web standards
• Data that was not publicly available – now published
(e.g. air quality data in Oslo)
• Time-efficient publishing
• Repeatable data transformation process
126
127. DataGraft and Big Data
• Desired features:
– real-time interactivity
– large datasets batch transformation capability
We are developing a hybrid solution to work with both
batch and real-time processing.
127
129. DataGraft – targeted impacts
Reduction in costs
for organisations which lack
sufficient expertise and resources to
make their data available
Reduction on the dependency
of data owners on generic Cloud platforms
to build, deploy and maintain their linked
data from scratch
Increase in the speed of
publishing
new datasets and updating existing
datasets
Reduction in the cost and
complexity of developing
applications that use data
Increase in the reuse of data
by providing reliable access to numerous
datasets hosted on DataGraft.net
129
130. • Gathering enough of good datasets
• Designing/implementing
2. Able to focus on
service quality
Example: The benefit of DataGraft in PLUQI
130
• Reducing cost for implementing
transformations
• Integrating the process is
simpler
1. 23% of development
cost reduction
Datasets
gathering
Data
transformation
Data
provisioning/access
Implementing
App
Before
Datasets
gathering
Data
transformation
Data
provisioning/
access
Implementing
App
After (with DataGraft)
131. DataGraft in numbers
(as of end of Jan 2016)
131
238
Registered users
607 (208 public)
Registered
Data transformations
1828
Uploaded files
192
Public Data
pages
132. DataGraft in the wild
• Investigating crime data in small geographies
• Used DataGraft to transform data and publish RDF
132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/
133. Data Science and DataGraft
Greater Data Science:
1. Data Exploration and
Preparation
2. Data Representation and
Transformation
3. Computing with Data
4. Data Visualization and
Presentation
5. Data Modeling
6. Science about Data Science
133
“50 years of Data Science” by David Donoho
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
DataGraft
136. Summary
• DataGraft – emerging Data-as-a-Service solution for
making (linked) data more accessible
– Platform, portal, methodology, APIs
– Online service, functional and documented
– Validated through several use cases
• Key features:
– Support for Sharable/Repeatable/Reusable Data
Transformations
– Reliable RDF Database-as-a-Service
136