Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Graph databases & data integration - the case of RDF
1. Graph databases
& data integration
The case of RDF
By Dimitris Kontokostas
AKSW/KILT - Leipzig
DBpedia Association
Thessaloniki Java Meetup / 09.05.2016
2. Thessaloniki Java meetup - 09.05.2016
About me
● I live in Veria
● I am an ex-ICT teacher
● Since 2003 I was working on mainly on R&D projects
○ + some web development
● Since 2012 doing a PhD & working in AKSW group in Leipzig
○ Focusing on semantic web technologies (RDF, SPARQL, and many other scary terms)
○ aka Knowledge Engineer
● I am on open source enthusiast (DBpedia, RDFUnit)
● Recently became a W3c specification editor for SHACL
● Walked across many langs but ended up in Scala, Java, & Bash
○ With bash / CLI as a first choice;)
6. Thessaloniki Java meetup - 09.05.2016
The four V’s heatmap for Graph Databases
Study in 2013 found:
● many organizations
find the “variety”
dimension a greater
challenge than
volume or velocity.
Graph DBs to the rescue:
● Combine multiple
sources with different
structures
● while retaining the
flexibility to add new ones
without adapting
schematas
● query combined data, or
multiple sources at once
● detecting patterns in the
data
(*) See also this
8. Thessaloniki Java meetup - 09.05.2016
● A graph is a way of specifying relationships among a collection of items
● Items
○ Nodes - Alice, Bob, …
○ Edges
■ undirected - knows, …
■ directed - follows, …
○ Values -- weights, distances, scores, 0-5 scale, …
○ Attributes - name, time, ...
Graphs
9. Thessaloniki Java meetup - 09.05.2016
Graph Data Models
Property graphs
● Industry standards
○ Neo4j, Titan, Apache TinkerPop, ...
○ App specific way for querying, exporting, importing, etc
○ Optimized for specific operation and in many cases faster
RDF Graphs
● W3c standards
○ Like XML / HTML, define once run everywhere TM
○ Standardised way for querying, exporting, importing
10. Thessaloniki Java meetup - 09.05.2016
Property Graphs
● Each node has a
○ unique identifier.
○ set of outgoing edges.
○ set of incoming edges.
○ collection of key-value properties.
● Each edge
○ Is directed
○ has a unique identifier.
○ has a label that denotes
the type of relationship
between its source and
○ target nodes.
○ has a collection of key-value
11. Thessaloniki Java meetup - 09.05.2016
RDF - Resource Description Framework
● An RDF Graph is a set of RDF Triples
● An RDF triple consists of (only) three components:
○ the subject (is an IRI)
○ the predicate (is an IRI)
○ the object (can be an IRI or Literal)
○ (subjects and objects can also be blank nodes but let’s leave it for now)
http://dbpedia.
org/resource/Java
dbo:latestReleaseVersion
“1.8.0_60”
http://dbpedia.
org/resource/C++
dbo:influencedBy
http://dbpedia.
org/resource/C#
dbo:influencedBy
Subject Predicate Object
12. Thessaloniki Java meetup - 09.05.2016
RDF is an abstract data model
Turtle
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix ex: <http://example.com/> .
ex:Dimitris a dbo:Person .
NTriples
<http://example.com/Dimitris> a <http://dbpedia.org/ontology/Person> .
JSON-LD
{ "@id": "http://example.com/Dimitris",
"@type": "http://dbpedia.org/ontology/Person" }
XML
<rdf:Description rdf:about="http://example.com/Dimitris">
<rdf:type rdf:resource="http://dbpedia.org/ontology/Person"/>
</rdf:Description>
RDFa (embedded in html)
<div xmlns="http://www.w3.org/1999/xhtml"
prefix=" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
dbo: http://dbpedia.org/ontology/
rdfs: http://www.w3.org/2000/01/rdf-schema#">
<div typeof="dbo:Person" about="http://example.com/Dimitris">
</div>
</div>
16. Thessaloniki Java meetup - 09.05.2016
RDF & Linked Data
● Using HTTP(s) based IRIs we get the Web of Data
○ See TED talk from Tim Berners Lee (Creator of WWW)
● Every RDF Resource becomes like a REST GET API that returns all the
RDF triples it is associated with
○ content negotiation for RDF (machine) or HTML (human)
○ Follow-your-nose pattern
http://dbpedia.
org/resource/Java
dbo:latestReleaseVersion
“1.8.0_60”
http://dbpedia.
org/resource/C++
dbo:influencedBy
http://dbpedia.
org/resource/C#
dbo:influencedBy
http://aksw.
org/DimitrisKontok
ostas
ex:learns
http://www.
geonames.
org/733905/
dbo:birthPlace
40.52437
22.20242
geo:lat
geo:long
18. Thessaloniki Java meetup - 09.05.2016
Vocabularies & Semantics
● Vocabularies/Ontologies define classes and predicates (properties) in
RDF
○ ex:Dimitris a dbo:Person
○ ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date
● Existing Vocabularies capture many use case
○ DBpedia ontology (general purpose)
○ Schema.org (general purpose / new backed by Google, Yahoo, Bing & Yandex)
○ Foaf (Friend of a friend)
○ Geo (geographical)
○ Prov-o (data provenance)
○ SKOS (classifications)
○ Org (organization structure)
○ … http://lov.okfn.org has more than 400
19. Thessaloniki Java meetup - 09.05.2016
Vocabularies & Semantics
● classes and predicates (properties) have definitions (semantics)
● ex:Dimitris a dbo:Person
○ dbo:Person Belongs in a class hierarchy
● ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date
○ dbo:birthDate expects a dbo:Person as subject
○ dbo:birthDate expects an xsd:date as object
● Reusing existing vocabularies (classes & properties) with defined
semantics is a good practice
○ Get part of the data modeling for free
○ Using common terms can help integrate data easier
○ Validation (or inference) for free
■ ex:Thessaloniki dbo:birthDate “1981-06-06”^^xsd:date (is Thessaloniki a Person?)
■ ex:Dimitris dbo:birthDate ex:Thessaloniki (ex:Thessaloniki is not an xsd:date)
20. Thessaloniki Java meetup - 09.05.2016
Data integration with RDF
● Very simple graph data model
● Convert your data to RDF and model against common vocabularies
○ Design applications against vocabularies
○ Integrate multiple different sources
● Local identifiers are a common integration problem
● Link to data authorities
○ ex:Dimitris dbo:birthPlace ex:Veria geonames:733905
○ (or) ex:Veria owl:sameAs geonames:733905
21. Thessaloniki Java meetup - 09.05.2016
Pay as you go Data Integration
● RDF views on top of RDBMS (e.g. MySQL) R2RML (W3c spec)
○ Mapping files defines how SQL queries / tables translate to RDF
○ Queryable through a virtual SPARQL endpoint translating SPARQL to SQL
● Convert XML/JSON/CSV/… to RDF with RML.io using mapping files
● Find links to external databases with Limes & Silk
○ e.g.: ex:Veria owl:sameAs geonames:733905
● You can get some benefit with low effort
● The more time you invest the better the results
● (Common practice) work on secondary RDF views of your data
22. Thessaloniki Java meetup - 09.05.2016
Who uses RDF (in public)
https://github.com/json-ld/json-ld.org/wiki/Users-of-JSON-LD
23. Thessaloniki Java meetup - 09.05.2016
Some More Statistics
● Based on the common crawl of Nov 2015
● 30% of HTML pages (541M / 1.77B pages) contained structured data.
● This 30% originates from 2.72M different pay-level-domains out of the
14.41 million pay-level-domains covered by the crawl (19%).
○ 521K websites use RDFa
○ 1.1 million Microdata
○ 586K have embedded json-ld (mostly for search actions)
● Altogether, the extracted data sets consist of 24.38 billion RDF quads.
http://webdatacommons.org/structureddata/2015-11/stats/stats.html#results-2015-1
25. Thessaloniki Java meetup - 09.05.2016
SPARQL
„Which films starred John Cleese without any other members
of Monty Python?“
SPARQL Examples by
Markus Ackermann &
Markus Freudenberg
34. Schema.org
● Vocabulary backed by all Search
engines
● RDF data model
○ Normative format is JSON-LD
○ RDF in not actively mentioned (to
not scare people away)
○ Allows use as general structured
data (e.g. microdata)
● Enriches a lot of (at least) Google’s
application
○ Search (try e.g. recipes)
○ Gmail (travel, events, actions,...)
○ Google Now
○ Google Knowledge Graph
○ ...
42. Thessaloniki Java meetup - 09.05.2016
Entity disambiguation
aka NERD (Named Entity Resolution & Disambiguation)
● George Bush is sitting in front of the White House
○ George: some George?
○ Bush: a small plant
○ George Bush: former president of USA
○ White: Colour
○ House: a house
○ White House:
● http://dbpedia-spotlight.github.io/demo/
43. Thessaloniki Java meetup - 09.05.2016
Data Quality
● As mentioned earlier, we can (re) use the vocabulary semantics for
automatic data validation
● RDFUnit - https://github.com/AKSW/RDFUnit
○ Automatically generates data unit tests based on the vocabularies your data uses
○ Custom JUnit runner
● SHACL - http://w3c.github.io/data-shapes/shacl/
○ Language to define advanced data constraints on RDF Graphs
○ (In progress) W3c recommendation
44. Thessaloniki Java meetup - 09.05.2016
ALIGNED project
● Aligning software & data engineering
● Tools & techniques for agility in changes in code / data
● http://aligned-project.eu
● Options a free consultancy in aligned tools
○ See website for more info
45. Thessaloniki Java meetup - 09.05.2016
Wrapping up / Key points
● Data variety is a common problem
● Integrating Data can be a pain :)
● Graph Databases can help, RDF can sometimes be more appropriate
● Pay as you go data integration
○ Map your data to RDF
○ Keep RDF as a copy of your source data
● RDF helps you develop reusable applications against schemas
● Schema.org
○ For website markups
○ For defining actions
● JSON-LD (embedded mappings)
● RDF for text annotations
● There is very good tool support for RDF in Java