Web Data Management in the RDF Age

Web Data Management in RDF Age
M. Tamer ¨Ozsu
University of Waterloo
David R. Cheriton School of Computer Science
1ICDCS’17/2017-06-07

Acknowledgements
This presentation draws upon collaborative research and
discussions with the following colleagues (in alphabetical order)
G¨une¸s Alu¸c, University of Waterloo; now at SAP
Khuzaima Daudjee, University of Waterloo
Olaf Hartig, University of Waterloo; now at Link¨oping Univ.
Lei Chen, Hong Kong University of Science & Technology
Lei Zou, Peking University
2ICDCS’17/2017-06-07

Web Data Management
A long term research interest in the DB community
2000 2004
2011 2011
3ICDCS’17/2017-06-07

Interest Due to Properties of Web Data
Lack of a schema
Data is at best “semi-structured”
Missing data, additional attributes, similar data but not
identical
Volatility
Changes frequently
May conform to one schema now, but not later
Scale
Does it make sense to talk about a schema for Web?
How do you capture “everything”?
Querying diﬃculty
What is the user language?
What are the primitives?
Arent search engines or metasearch engines suﬃcient?
4ICDCS’17/2017-06-07

More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML format
Possible joins between multiple data sets
Extensive visualization
5ICDCS’17/2017-06-07

Fusion Tables
XML
Data exchange language
Primarily tree based structure
<list title="MOVIES">
<film>
<title>The Shining</title>
<director>Stanley Kubrick</director>
<actor>Jack Nicholson</actor>
</film>
<film>
<title>Spartacus</title>
<director>Stanley Kubrick</director>
</film>
<film>
<title>The Passenger</title>
<actor>Jack Nicholson</actor>
</film>
...
</list>
root
film
title
“The Shining”
director
“Stanley Kubrick”
actor
“Jack Nicholson”
film
...
film
title
“The Passenger”
actor
5ICDCS’17/2017-06-07

Fusion Tables
XML
Data exchange language
Primarily tree based structure
Linked Open Data (LOD)
W3C work; community eﬀort
Maintains autonomy of data sources
Low barrier to entry
5ICDCS’17/2017-06-07

Traditional Hypertext-based Web Access
IMDb World
Book
Data exposed
to the Web
via HTML
6ICDCS’17/2017-06-07

Linked Data Publishing Principles
IMDb World
Book
(http://...linkedmdb.../Shining,releaseDate, 23 May 1980)
(http://...linkedmdb.../Shining, ﬁlmLocation, http://cia.../UK)
(http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining)
...
(http://cia.../UK, hasPopulation, 63230000)
...
Shining
UK
Data model: RDF
Global identiﬁer: URI
Access mechanism: HTTP
Connection: data links
7ICDCS’17/2017-06-07

Linked Object Data – Closer Look
8ICDCS’17/2017-06-07

LOD Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 3000 datasets with
>84B triples
Size almost doubling every year
9ICDCS’17/2017-06-07

>84B triples
As of March 2009
LinkedCT
Reactome
Taxonomy
KEGG
PubMed
GeneID
Pfam
UniProt
OMIM
PDB
Symbol
ChEBI
Daily
Med
Disea-
some
CAS
HGNC
Inter
Pro
Drug
Bank
UniParc
UniRef
ProDom
PROSITE
Gene
Ontology
Homolo
Gene
Pub
Chem
MGI
UniSTS
GEO
Species
Jamendo
BBC
Programm
es
Music-
brainz
Magna-
tune
BBC
Later +
TOTP
Surge
Radio
MySpace
Wrapper
Audio-
Scrobbler
Linked
MDB
BBC
John
Peel
BBC
Playcount
Data
Gov-
Track
US
Census
Data
riese
Geo-
names
lingvoj
World
Fact-
book
Euro-
stat
IRIT
Toulouse
SW
Conference
Corpus
RDF Book
Mashup
Project
Guten-
berg
DBLP
Hannover
DBLP
Berlin
LAAS-
CNRS
Buda-
pest
BME
IEEE
IBM
Resex
Pisa
New-
castle
RAE
2001
CiteSeer
ACM
DBLP
RKB
Explorer
eprints
LIBRIS
Semantic
Web.org Eurécom
ECS
South-
ampton
RevyuSIOC
Sites
Doap-
space
Flickr
exporter
FOAF
profiles
flickr
wrappr
Crunch
Base
Sem-
Web-
Central
Open-
Guides
Wiki-
company
QDOS
Pub
Guide
Open
Calais
RDF
ohloh
W3C
WordNet
Open
Cyc
UMBEL
Yago
DBpedia
Freebase
Virtuoso
Sponger
March ’09:
89 datasets
9ICDCS’17/2017-06-07
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/

>84B triples
As of September 2010
Music
Brainz
(zitgist)
P20
YAGO
World
Fact-
book
(FUB)
WordNet
(W3C)
WordNet
(VUA)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UMBEL
UK Post-
codes
legislation
.gov.uk
Uberblic
UB
Mann-
heim
TWC LOGD
Twarql
transport
data.gov
.uk
totl.net
Tele-
graphis
TCM
Gene
DIT
Taxon
Concept
The Open
Library
(Talis)
t4gm
Surge
Radio
STW
RAMEAU
SH
statistics
data.gov
.uk
St.
Andrews
Resource
Lists
ECS
South-
ampton
EPrints
Semantic
Crunch
Base
semantic
web.org
Semantic
XBRL
SW
Dog
Food
rdfabout
US SEC
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-
castle
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints
dotAC
DEPLOY
DBLP
(RKB
Explorer)
Course-
ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov
.uk
reference
data.gov
.uk
Recht-
spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
PSH
Product
DB
PBAC
Poké-
pédia
Ord-
nance
Survey
Openly
Local
The Open
Library
Open
Cyc
Open
Calais
OpenEI
New
York
Times
NTU
Resource
Lists
NDL
subjects
MARC
Codes
List
Man-
chester
Reading
Lists
Lotico
The
London
Gazette
LOIUS
lobid
Resources
lobid
Organi-
sations
Linked
MDB
Linked
LCCN
Linked
GeoData
Linked
CT
Linked
Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Good-
win
Family
Jamendo
iServe
NSZL
Catalog
GovTrack
GESIS
Geo
Species
Geo
Names
Geo
Linked
Data
(es)
GTAA
STITCH
SIDER
Project
Guten-
berg
(FUB)
Medi
Care
Euro-
stat
(FUB)
Drug
Bank
Disea-
some
DBLP
(FU
Berlin)
Daily
Med
Freebase
flickr
wrappr
Fishes
of Texas
FanHubz
Event-
Media
EUTC
Produc-
tions
Eurostat
EUNIS
ESD
stan-
dards
Popula-
tion (En-
AKTing)
NHS
(EnAKTing)
Mortality
(En-
AKTing)
Energy
(En-
AKTing)
CO2
(En-
AKTing)
education
data.gov
.uk
ECS
South-
ampton
Gem.
Norm-
datei
data
dcs
MySpace
(DBTune)
Music
Brainz
(DBTune)
Magna-
tune
John
Peel
(DB
Tune)
classical
(DB
Tune)
Audio-
scrobbler
(DBTune)
Last.fm
Artists
(DBTune)
DB
Tropes
dbpedia
lite
DBpedia
Pokedex
Airports
NASA
(Data
Incu-
bator)
Music
Brainz
(Data
Incubator)
Moseley
Folk
Discogs
(Data In-
cubator)
Climbing
Linked Data
for Intervals
Cornetto
Chronic-
ling
America
Chem2
Bio2RDF
biz.
data.
gov.uk
UniSTS
UniRef
Uni
Path-
way
UniParc
Taxo-
nomy
UniProt
SGD
Reactome
PubMed
Pub
Chem
PRO-
SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Cpd
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Gen
Bank
ChEBI
CAS
Affy-
metrix
BibBase
BBC
Wildlife
Finder
BBC
Program
mes
BBC
Music
rdfabout
US Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
September ’10:
203 datasets
9ICDCS’17/2017-06-07

>84B triples
As of September 2011
Music
Brainz
(zitgist)
P20
Turismo
de
Zaragoza
yovisto
Yahoo!
Geo
Planet
YAGO
World
Fact-
book
El
Viajero
Tourism
WordNet
(W3C)
WordNet
(VUA)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UniRef
UniProt
UMBEL
UK Post-
codes
legislation
data.gov.uk
Uberblic
UB
Mann-
heim
TWC LOGD
Twarql
transport
data.gov.
uk
Traffic
Scotland
theses.
fr
Thesau-
rus W
totl.net
Tele-
graphis
TCM
Gene
DIT
Taxon
Concept
Open
Library
(Talis)
tags2con
delicious
t4gm
info
Swedish
Open
Cultural
Heritage
Surge
Radio
Sudoc
STW
RAMEAU
SH
statistics
data.gov.
uk
St.
Andrews
Resource
Lists
ECS
South-
ampton
EPrints
SSW
Thesaur
us
Smart
Link
Slideshare
2RDF
semantic
web.org
Semantic
Tweet
Semantic
XBRL
SW
Dog
Food
Source Code
Ecosystem
Linked Data
US SEC
(rdfabout)
Sears
Scotland
Geo-
graphy
Scotland
Pupils &
Exams
Scholaro-
meter
WordNet
(RKB
Explorer)
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-
castle
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP
(RKB
Explorer)
Crime
Reports
UK
Course-
ware
CORDIS
(RKB
Explorer)
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov.
ukRen.
Energy
Genera-
tors
reference
data.gov.
uk
Recht-
spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
Rådata
nå!
PSH
Product
Types
Ontology
Product
DB
PBAC
Poké-
pédia
patents
data.go
v.uk
Ox
Points
Ord-
nance
Survey
Openly
Local
Open
Library
Open
Cyc
Open
Corpo-
rates
Open
Calais
OpenEI
Open
Election
Data
Project
Open
Data
Thesau-
rus
Ontos
News
Portal
OGOLOD
Janus
AMP
Ocean
Drilling
Codices
New
York
Times
NVD
ntnusc
NTU
Resource
Lists
Norwe-
gian
MeSH
NDL
subjects
ndlna
my
Experi-
ment
Italian
Museums
medu-
cator
MARC
Codes
List
Man-
chester
Reading
Lists
Lotico
Weather
Stations
London
Gazette
LOIUS
Linked
Open
Colors
lobid
Resources
lobid
Organi-
sations
LEM
Linked
MDB
LinkedL
CCN
Linked
GeoData
LinkedCT
Linked
User
Feedback
LOV
Linked
Open
Numbers
LODE
Eurostat
(Ontology
Central)
Linked
EDGAR
(Ontology
Central)
Linked
Crunch-
base
lingvoj
Lichfield
Spen-
ding
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Klapp-
stuhl-
club
Good-
win
Family
National
Radio-
activity
JP
Jamendo
(DBtune)
Italian
public
schools
ISTAT
Immi-
gration
iServe
IdRef
Sudoc
NSZL
Catalog
Hellenic
PD
Hellenic
FBD
Piedmont
Accomo-
dations
GovTrack
GovWILD
Google
Art
wrapper
gnoss
GESIS
GeoWord
Net
Geo
Species
Geo
Names
Geo
Linked
Data
GEMET
GTAA
STITCH
SIDER
Project
Guten-
berg
Medi
Care
Euro-
stat
(FUB)
EURES
Drug
Bank
Disea-
some
DBLP
(FU
Berlin)
Daily
Med
CORDIS
(FUB)
Freebase
flickr
wrappr
Fishes
of Texas
Finnish
Munici-
palities
ChEMBL
FanHubz
Event
Media
EUTC
Produc-
tions
Eurostat
Europeana
EUNIS
EU
Insti-
tutions
ESD
stan-
dards
EARTh
Enipedia
Popula-
tion (En-
AKTing)
NHS
(En-
AKTing) Mortality
(En-
AKTing)
Energy
(En-
AKTing)
Crime
(En-
AKTing)
CO2
Emission
(En-
AKTing)
EEA
SISVU
educatio
n.data.g
ov.uk
ECS
South-
ampton
ECCO-
TCP
GND
Didactal
ia
DDC Deutsche
Bio-
graphie
data
dcs
Music
Brainz
(DBTune)
Magna-
tune
John
Peel
(DBTune)
Classical
(DB
Tune)
Audio
Scrobbler
(DBTune)
Last.FM
artists
(DBTune)
DB
Tropes
Portu-
guese
DBpedia
dbpedia
lite
Greek
DBpedia
DBpedia
data-
open-
ac-uk
SMC
Journals
Pokedex
Airports
NASA
(Data
Incu-
bator)
Music
Brainz
(Data
Incubator)
Moseley
Folk
Metoffice
Weather
Forecasts
Discogs
(Data
Incubator)
Climbing
data.gov.uk
intervals
Data
Gov.ie
data
bnf.fr
Cornetto
reegle
Chronic-
ling
America
Chem2
Bio2RDF
Calames
business
data.gov.
uk
Bricklink
Brazilian
Poli-
ticians
BNB
UniSTS
UniPath
way
UniParc
Taxono
my
UniProt
(Bio2RDF)
SGD
Reactome
PubMed
Pub
Chem
PRO-
SITE
ProDom
Pfam
PDB
OMIM
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Com-
pound
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Affy-
metrix
bible
ontology
BibBase
FTS
BBC
Wildlife
Finder
BBC
Program
mes BBC
Music
Alpine
Ski
Austria
LOCAH
Amster-
dam
Museum
AGROV
OC
AEMET
US Census
(rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
September ’11:
295 datasets, 25B
triples
9ICDCS’17/2017-06-07

>84B triples
April ’14:
570 datasets, ???
triples
9ICDCS’17/2017-06-07
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of Linked
Data Best Practices in Diﬀerent Topical Domains. In Proc. ISWC, 2014.

Globally Distributed Network of Data
10ICDCS’17/2017-06-07

Outline
1 RDF Technology [¨Ozsu, 2016]
Data Warehousing Approach
Distributed RDF Processing
2 Federated RDF Systems
SPARQL Endpoint Federation
General RDF Federation
3 LOD – Live Querying Approach [Hartig, 2013a]
Traversal-based approaches
Index-based approaches
Hybrid approaches
4 Conclusions
11ICDCS’17/2017-06-07

Outline
Hybrid approaches
4 Conclusions
12ICDCS’17/2017-06-07

RDF Introduction
Everything is an uniquely named
resource
http://data.linkedmdb.org/resource/actor/JN29704
13ICDCS’17/2017-06-07

RDF Introduction
resource
Preﬁxes can be used to shorten the
names
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
13ICDCS’17/2017-06-07

RDF Introduction
resource
names
Properties of resources can be
deﬁned
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
13ICDCS’17/2017-06-07

RDF Introduction
resource
names
deﬁned
Relationships with other resources
can be deﬁned
y:JN29704
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
13ICDCS’17/2017-06-07

RDF Introduction
resource
names
defined
Relationships with other resources
can be defined
Resource descriptions can be
contributed by different
people/groups and can be located
anywhere in the web
Integrated web “database”
y:JN29704
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
13ICDCS’17/2017-06-07

RDF Data Model
Triple: Subject, Predicate (Property),
Object (s, p, o)
Subject: the entity that is described
(URI or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI,
blank node or literal)
(s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Predicate
Subject Predicate Object
http://...imdb.../ﬁlm/2014 rdfs:label “The Shining”
http://...imdb.../ﬁlm/2014 movie:releaseDate “1980-05-23”
http://...imdb.../29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
14ICDCS’17/2017-06-07

RDF Example Instance
Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/
bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/
lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/
Subject Predicate Object
mdb: film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”’
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
15ICDCS’17/2017-06-07

RDF Graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
mob:music contributor
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
bm:persons/StephenKing
dc:creator
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
“Shelley Duvall”
movie:actor name
“English”
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
16ICDCS’17/2017-06-07

UniProt in RDF http://www.uniprot.org
UniProt collects data from >150 biological resources
Claim: “lack of a common standard to represent and link
information makes data integration an expensive business” ⇒
RDF can help
17ICDCS’17/2017-06-07

UniProt in RDF – What does the data look like?
UniProt accession for the human CYP51 protein – Q16850
Encode it as RDF:
18ICDCS’17/2017-06-07
http://purl.uniprot.org/uniprot/Q16850.rdf

Encode it as RDF:
XML/RDF format
<rdf:Description 2-¿rdf:about=”http://purl.uniprot.org/citations/8619637”>
<rdf:type 2-¿rdf:resource=”http://purl.uniprot.org/core/Journal Citation”/>
<title>The ubiquitously expressed human CYP51 encodes lanosterol 14 alpha-demethylase, a
cytochrome P450 whose expression is regulated by oxysterols.</title>
<author>Stroemstedt M.</author>
<author>Rozman D.</author>
<author>Waterman M.R.</author>
<skos:exactMatch rdf:resource=”http://purl.uniprot.org/pubmed/8619637”/>
<foaf:primaryTopicOf rdf:resource=”https://www.ncbi.nlm.nih.gov/pubmed/8619637”/>
<dcterms:identiﬁer>doi:10.1006/abbi.1996.0193</dcterms:identiﬁer>
<date rdf:datatype=”http://www.w3.org/2001/XMLSchema#gYear”>1996</date>
<name>Arch. Biochem. Biophys.</name>
<volume>329</volume>
<pages>73-81</pages>
</rdf:Description>
Subject
Predicate
Object
18ICDCS’17/2017-06-07

Encode it as RDF:
XML/RDF format
<rdf:Description 2-¿rdf:about=”http://purl.uniprot.org/citations/8619637”>
<rdf:type 2-¿rdf:resource=”http://purl.uniprot.org/core/Journal Citation”/>
<title>The ubiquitously expressed human CYP51 encodes lanosterol 14 alpha-demethylase, a
cytochrome P450 whose expression is regulated by oxysterols.</title>
<author>Stroemstedt M.</author>
<author>Rozman D.</author>
<author>Waterman M.R.</author>
<skos:exactMatch rdf:resource=”http://purl.uniprot.org/pubmed/8619637”/>
<foaf:primaryTopicOf rdf:resource=”https://www.ncbi.nlm.nih.gov/pubmed/8619637”/>
<dcterms:identiﬁer>doi:10.1006/abbi.1996.0193</dcterms:identiﬁer>
<date rdf:datatype=”http://www.w3.org/2001/XMLSchema#gYear”>1996</date>
<name>Arch. Biochem. Biophys.</name>
<volume>329</volume>
<pages>73-81</pages>
</rdf:Description>
This can be shown as a table <Subject, Predicate, Object >
Subject
Predicate
Object
18ICDCS’17/2017-06-07

RDF Query Model – SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is deﬁned recursively:
an atomic triple pattern, which is an element of
(U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
19ICDCS’17/2017-06-07

RDF Query Model – SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of
variables), a SPARQL expression is deﬁned recursively:
an atomic triple pattern, which is an element of
(U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is a
built-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph
pattern expressions
Example:
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r > 4.0)
}
19ICDCS’17/2017-06-07

SPARQL Queries
SELECT ?name
WHERE {
FILTER(? r > 4.0)
}
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
20ICDCS’17/2017-06-07

UniProt in RDF – The Data Can Be Queried
RDF encoded UniProt data can be queried using SPARQL:
http://sparql.uniprot.org/sparql
21ICDCS’17/2017-06-07

UniProt in RDF – The Data Can Be Queried
RDF encoded UniProt data can be queried using SPARQL:
http://sparql.uniprot.org/sparql
Get the GO function for Q16850 (from UniProt SPARQL endpoint)
PREFIX upc:< http :// p u r l . u n i p r o t . org / core/>
PREFIX r d f : <http ://www. w3 . org /1999/02/22− rdf −syntax−ns#>
SELECT ? goid ? g o l a b e l
WHERE {
<http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850> a upc : Protein ;
upc : c l a s s i f i e d W i t h ? keyword .
? keyword r d f s : seeAlso ? goid .
? goid r d f s : l a b e l ? g o l a b e l .
}
Find the diﬀerential expression of probes and the p
value that map to Q16850 (from Expression Atlas SPARQL endpoint)
PREFIX r d f s : <http ://www. w3 . org /2000/01/ rdf −schema#>
PREFIX a t l a s t e r m s : <http :// r d f . ebi . ac . uk/ terms / a t l a s />
SELECT d i s t i n c t ? valueLabel ? pvalue
WHERE {
? value r d f s : l a b e l ? valueLabel .
? value a t l a s t e r m s : pValue ? pvalue .
? value a t l a s t e r m s : isMeasurementOf ? probe .
? probe a t l a s t e r m s : dbXref <http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850> .
}
ORDER BY ASC(? pvalue )
21ICDCS’17/2017-06-07

Na¨ıve Triple Store Design
SELECT ?name
WHERE {
FILTER(? r > 4.0)
}
Subject Property Object
mdb:ﬁlm/2014 rdfs:label “The Shining”
mdb:ﬁlm/2014 movie:initial release date “1980-05-23”
22ICDCS’17/2017-06-07

SELECT ?name
WHERE {
FILTER(? r > 4.0)
}
SELECT T1 . o b j e c t
FROM
T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o > 4.0
AND T5 . o=” S t a n l e y Kubrick ”
22ICDCS’17/2017-06-07

SELECT ?name
WHERE {
FILTER(? r > 4.0)
}
SELECT T1 . o b j e c t
FROM
T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o > 4.0
AND T5 . o=” S t a n l e y Kubrick ”
Easy to implement
but
too many self-joins!
22ICDCS’17/2017-06-07

Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore
[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Create indexes for permutations of the three columns: SPO,
SOP, PSO, POS, OPS, OSP
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ﬁlm/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:ﬁlm/2685
9 movie:director23ICDCS’17/2017-06-07

Exhaustive Indexing
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ﬁlm/2014
1 rdfs:label
2 “The Shining”
4 “1980-05-23”
5 mdb:director/8476
8 mdb:ﬁlm/2685
9 movie:director23ICDCS’17/2017-06-07

Exhaustive Indexing
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ﬁlm/2014
1 rdfs:label
2 “The Shining”
4 “1980-05-23”
5 mdb:director/8476
8 mdb:ﬁlm/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
23ICDCS’17/2017-06-07

Exhaustive Indexing
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: ﬁlm/2014
1 rdfs:label
2 “The Shining”
4 “1980-05-23”
5 mdb:director/8476
8 mdb:ﬁlm/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
Disadvantages
Space usage
Expensive updates
23ICDCS’17/2017-06-07

Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF
[Bornea et al., 2013]
Clustered property table: group together the properties that
tend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type
of property into one property table
. . . . . . . . .
Subject refs:label movie:director
mob:ﬁlm/2014 “The Shining” mob:director/8476
mob:ﬁlm/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
24ICDCS’17/2017-06-07

Property Tables
. . . . . . . . .
Advantages
Fewer joins
If the data is structured, we have a relational system – similar
to normalized relations
24ICDCS’17/2017-06-07

Property Tables
. . . . . . . . .
Advantages
Fewer joins
If the data is structured, we have a relational system – similar
to normalized relations
Disadvantages
Potentially a lot of NULLs
Clustering is not trivial
Multi-valued properties are complicated
24ICDCS’17/2017-06-07

Vertical Partitioning
Binary Tables [Abadi et al., 2007, 2009]:
Grouping by properties – for each property, build a two-column
table, containing both subject and object, ordered by subjects
n two column tables (n is the number of unique properties in
the data)
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
25ICDCS’17/2017-06-07

the data)
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
25ICDCS’17/2017-06-07

the data)
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
Disadvantages
Not useful for subject-object joins
Expensive inserts
25ICDCS’17/2017-06-07

the data)
TripleBit [Yuan et al., 2013]:
Create a table with |triple| columns, |objects| + |subjects| rows
with “1” if object/subject exists in triple; groups columns by
predicate
Compress columns (since they are sparse); partition by
predicate, then partition into chunks
(P,S,O) and (P,O,S) indexes on the chunks
25ICDCS’17/2017-06-07

Graph-based Approach [Zou and Özsu, 2017]
Answering SPARQL query ≡ subgraph matching using
homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
“The Shining”
refs:label
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
dc:creator
geo:2635167
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
refs:label
mdb:director/8476
movie:director name
mdb:film/2685
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:actor name
“English”
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
Subgraph
M
atching
26ICDCS’17/2017-06-07

homomorphism
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
“The Shining”
refs:label
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
dc:creator
geo:2635167
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
refs:label
mdb:director/8476
movie:director name
mdb:film/2685
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:actor name
“English”
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
26ICDCS’17/2017-06-07

homomorphism
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
“The Shining”
refs:label
music contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
dc:creator
geo:2635167
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
refs:label
mdb:director/8476
movie:director name
mdb:film/2685
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:actor name
“English”
rdf:label
lexvo:iso3166/CA
lvont:usedIn
lexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
Disadvantages
Graph pattern matching is expensive
26ICDCS’17/2017-06-07

Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
bitstrings, denoted as vS ig(u). We encode query Q with the
same encoding method. Consequently, the match between Q
and G can be veriﬁed by simply checking the match between
corresponding encoded bitstrings.
Given a vertex u, we encode each of its adjacent edges
e(eLabel, nLabel) into a bitstring, where eLabel is the edge
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
27ICDCS’17/2017-06-07

Two Systems
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
12,000 lines of C++ code under
Linux (plus code for SPARQL parser)
Encode each vertex of RDF graph as
a bit array capturing the
neighbourhood relationship (G∗
)
Build a multilevel summary tree index
(VS∗
-tree) to capture “connections”
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
27ICDCS’17/2017-06-07

Two Systems
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Encode the query graph similarly (Q∗
)
Find candidate matching nodes of Q∗
in G∗
using VS*-tree
Multiway join of the candidates
12,000 lines of C++ code under
Linux (plus code for SPARQL parser)
Encode each vertex of RDF graph as
a bit array capturing the
neighbourhood relationship (G∗
)
Build a multilevel summary tree index
(VS∗
-tree) to capture “connections”
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
27ICDCS’17/2017-06-07

Two Systems
35,000 lines of C++ code under Linux
(plus code for SPARQL 1.0 parser)
Adaptivity to workload due to
variability of Web workloads and the
variability of composition of SPARQL
triple patterns
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner
across all queries
No single system is the sole loser
across all queries, either
2–5 orders of magnitude diﬀerence
in the performance between the
best and the worst system for a
given query
The winner in one query may
timeout in another
Performance diﬀerence widens as
dataset size increases
Group-by-query approach [Alu¸c et al.,
2014b]
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
27ICDCS’17/2017-06-07

Remember the Environment
Distributed environment
Some (not all) of the data
sites can process SPARQL
queries – SPARQL
endpoints
See next section
28ICDCS’17/2017-06-07

Some (not all) of the data
sites can process SPARQL
queries – SPARQL
endpoints
See next section
Alternatives
Cloud-based approaches
Data re-distribution +
query decomposition
Data re-distribution +
partial evaluation
28ICDCS’17/2017-06-07

Cloud-based Solutions [Kaoudi and Manolescu, 2015]
RDF data warehouse D is partitioned ({D1, . . . , Dn}) and
placed on cloud platforms (such as HDFS, HBase)
29ICDCS’17/2017-06-07

SPARQL query is run through MapReduce jobs
Data parallel execution
29ICDCS’17/2017-06-07

Examples: HARD [Rohloﬀ and Schantz, 2010] , HadoopRDF
[Husain et al., 2011] , EAGRE [Zhang et al., 2013] and
JenaHBase [Khadilkar et al., 2012]
29ICDCS’17/2017-06-07

Examples: HARD [Rohloﬀ and Schantz, 2010] , HadoopRDF
[Husain et al., 2011] , EAGRE [Zhang et al., 2013] and
JenaHBase [Khadilkar et al., 2012]
High scalability and fault-tolerance
Possibly low performance since MapReduce is not suitable for
graph processing
29ICDCS’17/2017-06-07

Partition-based Approaches
(Oﬄine) Partition an RDF data warehouse (graph) into
several fragments that are distributed to sites
RDF data D = {D1, . . . , Dn}
Allocate each Di to a site
30ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])
Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])
Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
30ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
(Online) SPARQL query decomposed Q = {Q1, . . . , Qk} ⇒
query graph is decomposed
Distributed execution of {Q1, . . . , Qk} over {D1, . . . , Dn}
30ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
Examples: GraphPartition [Huang et al., 2011], WARP [Hose
and Schenkel, 2013] , Partout [Galarraga et al., 2014] ,
Vertex-block [Lee and Liu, 2013]
30ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
Examples: GraphPartition [Huang et al., 2011], WARP [Hose
and Schenkel, 2013] , Partout [Galarraga et al., 2014] ,
Vertex-block [Lee and Liu, 2013]
High performance
Great for parallelizing centralized RDF data
May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
Each approach requires a speciﬁc partitioning strategy – no
generic partitioning
Query decomposition may not be easy
30ICDCS’17/2017-06-07

Partial Query Evaluation (PQE)
RDF data warehouse is partitioned and distributed as before
RDF data D = {D1, . . . , Dn}
SPARQL query is not decomposed
Partial query evaluation – Distributed gStore [Peng et al., 2016]
31ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
f (x) ⇒ f (s, d) ⇒ f (f (s), d)) ⇒ Final Answerf (s, d)
known inputs unknown inputs
31ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
f (f (s), d))
partial results
31ICDCS’17/2017-06-07

RDF data D = {D1, . . . , Dn}
f (f (s), d))
partial results
Query is the function and each Di is the known input
31ICDCS’17/2017-06-07

Distributed SPARQL Using PQE [Peng et al., 2016]
Two steps:
1. Evaluate a query at each site to ﬁnd local matches
These are local partial matches
D1
D2
D3
D4
32ICDCS’17/2017-06-07

Two steps:
2. Assemble the partial matches to get ﬁnal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
32ICDCS’17/2017-06-07

Two steps:
2. Assemble the partial matches to get ﬁnal result
Crossing match
Centralized assembly
Distributed assembly
D1
D2
D3
D4
Crossing match
High performance due to parallelization
Do not have to deal with query decomposition
May not be possible to re-partition and re-allocate Web data
(i.e., LOD)
RDF storage sites need to be modiﬁed to handle partial query
processing
32ICDCS’17/2017-06-07

Outline
Hybrid approaches
4 Conclusions
33ICDCS’17/2017-06-07

SPARQL endpoints can
process SPARQL queries
Non-SPARQL endpoints
require additional
components
34ICDCS’17/2017-06-07

SPARQL endpoints can
process SPARQL queries
Non-SPARQL endpoints
require additional
components
Issues
Query decomposition
Localization (source
selection)
Result composition
34ICDCS’17/2017-06-07

No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint
SPARQL query decomposed Q = {Q1, . . . , Qk}
E.g.: SPLENDID [G¨orlitz and Staab, 2011], ANAPSID
[Acosta et al., 2011]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
35ICDCS’17/2017-06-07

Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
35ICDCS’17/2017-06-07

Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
Data included at the source
Supported access patterns
Statistical information
· · ·
35ICDCS’17/2017-06-07

Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
Metadata
Data integration approach
May be the only way to proceed if RDF data is already
distributed with autonomous owners
Not all RDF data storage points are SPARQL endpoints
35ICDCS’17/2017-06-07

Not All RDF Storage Sites are SPARQL Endpoints
Use the mediator-wrapper paradigm
Wrappers provide SPARQL endpoint functionality
Mediators may be introduced if wrappers are thin
E.g.: DARQ [Quilitz and Leser, 2008], FedX [Schwarte et al.,
2011b]
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Storage
A
Wrapper
Wrapper
Mediator
RDF Storage
B
RDF Sources
Control Site
Metadata
36ICDCS’17/2017-06-07

Federated Query Processing
Query
Decomposition &
Source Selection
SPARQL queries
Local Evaluation
Join Partial
Matches
SPARQL matches
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
RDF Sources
Control Site
37ICDCS’17/2017-06-07

Query Decomposition
Each triple pattern has to a set of RDF sources based on the
values of its subject, property, and object.
SELECT ?x ?n
WHERE {
?x g : parentFeature ?k .
?k g : name ”Canada” .
?y sameAs ?x .
?y n : topicPage ?n .
}
38ICDCS’17/2017-06-07

Query Decomposition
SELECT ?x ?n
WHERE {
?y sameAs ?x .
}
{GeoNames} {GeoNames} {DBPedia,GeoNames,NYTimes,
SWDogFood,LinkedMDB}
{NYTimes}
38ICDCS’17/2017-06-07

Query Decomposition
SELECT ?x ?n
WHERE {
?y sameAs ?x .
}
{GeoNames} {GeoNames} {DBPedia,GeoNames,NYTimes,
SWDogFood,LinkedMDB}
{NYTimes}
q1
@{GeoNames} q2
@{. . .} q3
@{NYTimes}
SELECT ?x
WHERE {
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?n
WHERE {
}
38ICDCS’17/2017-06-07

Data Localization
Metadata-based approaches
Use the information in the metadata repository to determine
which sources are relevant
DARQ [Quilitz and Leser, 2008]
QTree [Harth et al., 2010; Prasser et al., 2012]
HiBISCus [Saleem and Ngomo, 2014]
. . .
ASK query-based approach
Asking whether or not a triple pattern has an answer at a
source
FedX [Schwarte et al., 2011a,b]
39ICDCS’17/2017-06-07

Query Processing over Federated RDF Systems
Jamendo
SWDogFood
GeoNames
LinkedMDB
DBPedia NYTimes
Metadata
SELECT ?x ?n
WHERE {
?y sameAs ?x .
}
SELECT ?x
WHERE {
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?x
WHERE {
?y sameAs ?x .
}
SELECT ?y ?n
WHERE {
}
40ICDCS’17/2017-06-07

UniProt Federation – EBI RDF Platform
Curated computational models of biological pro-
cesses
Sample information for reference samples and sam-
ples for which data exist in one of the EBI’s assay
databases
Curated chemical database of bioactive molecules
with drug-like properties
Genome databases for vertebrates and other eukary-
otic species
Gene expression data from the Gene Expression Atlas
Curated and peer-reviewed pathways
41ICDCS’17/2017-06-07

Federated Access to UniProt Collection
Get the Reactome pathways where Q16850 is associated, then get all the
other proteins in that pathway and pull out their expression from the atlas,
along with the GO annotations from UniProt
PREFIX r d f : <http ://www. w3 . org /1999/02/22− rdf −syntax−ns#>
PREFIX r d f s : <http ://www. w3 . org /2000/01/ rdf −schema#>
PREFIX biopax3 : <http ://www. biopax . org / r e l e a s e / biopax−l e v e l 3 . owl#>
PREFIX a t l a s t e r m s : <http :// r d f . ebi . ac . uk/ terms / a t l a s />
PREFIX upc:< http :// p u r l . u n i p r o t . org / core/>
SELECT DISTINCT ?pathwayname ? e x p r e s s i o n V a l u e ? g o l a b e l
WHERE {
# Get the pathways that r e f e r e n c e Q16850
? pathway r d f : type biopax3 : Pathway .
? pathway biopax3 : displayName ?pathwayname .
? pathway biopax3 : pathwayComponent
[? r e l [ biopax3 : e n t i t y R e f e r e n c e ? dbXref ] ] .
? pathway biopax3 : pathwayComponent
[? r e l [ biopax3 : e n t i t y R e f e r e n c e <http :// p u r l . u n i p r o t . org / u n i p r o t /Q16850 >]] .
# Get the e x p r e s s i o n f o r those p r o t e i n s
SERVICE <http ://www. e bi . ac . uk/ r d f / s e r v i c e s / a t l a s / sparql > {
? value r d f s : l a b e l ? e x p r e s s i o n V a l u e .
? value a t l a s t e r m s : pValue ? pvalue .
? value a t l a s t e r m s : isMeasurementOf ? probe .
? probe a t l a s t e r m s : dbXref ? dbXref .
}
# get the GO f u n c t i o n s from Uniprot
SERVICE <http :// u n i p r o t . org / sparql > {
? dbXref a upc : Protein ;
upc : c l a s s i f i e d W i t h ? keyword .
? keyword r d f s : seeAlso ? goid .
? goid r d f s : l a b e l ? g o l a b e l .
}
}
42ICDCS’17/2017-06-07

Outline
Hybrid approaches
4 Conclusions
43ICDCS’17/2017-06-07

Live Query Processing
Not all data resides at
SPARQL endpoints
Freshness of access to data
important
Potentially countably inﬁnite
data sources
Live querying
On-line execution
Only rely on linked data
principles
Alternatives
Traversal-based
approaches
Hybrid approaches
44ICDCS’17/2017-06-07

Linked Data Model [Hartig, 2012]
Web of Linked Data
Given a finite or countably infinite set D of Linked Documents, a
Web of Linked Data is a tuple W = (D, adoc, data) where:
D ⊆ D,
adoc is a partial mapping from URIs to D, and
data is a total mapping from D to finite sets of RDF triples.
45ICDCS’17/2017-06-07

Linked Data Model [Hartig, 2012]
Web of Linked Data
Given a finite or countably infinite set D of Linked Documents, a
Web of Linked Data is a tuple W = (D, adoc, data) where:
D ⊆ D,
adoc is a partial mapping from URIs to D, and
data is a total mapping from D to finite sets of RDF triples.
Data Links
A Web of Linked Data W = (D, adoc, data)
contains a data link from document d ∈ D to
document d ∈ D if there exists a URI u such
that:
u is mentioned in an RDF triple
t ∈ data(d), and
d = adoc(u).
45ICDCS’17/2017-06-07

SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
46ICDCS’17/2017-06-07

SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked Data
Query result completeness cannot be guaranteed by any
(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S,
and a reachability condition c
Scope: all data along paths of data links that satisfy the
condition
Computationally feasible
46ICDCS’17/2017-06-07

Traversal Approaches
Discover relevant URIs recursively
by traversing (speciﬁc) data links
at query execution runtime [Hartig,
2013b; Ladwig and Tran, 2011]
Implements reachability-based
query semantics
Start from a set of seed URIs
Recursively follow and discover
new URIs
Important issue is selection of seed
URIs
Retrieved data serves to discover
new URIs and to construct result
47ICDCS’17/2017-06-07

query semantics
new URIs
URIs
Advantages
Easy to implement.
No data structure to maintain.
47ICDCS’17/2017-06-07

query semantics
new URIs
URIs
Advantages
Easy to implement.
No data structure to maintain.
Disadvantages
Possibilities for parallelized data retrieval are limited
Repeated data retrieval introduces signiﬁcant query latency.
47ICDCS’17/2017-06-07

Index Approaches
Use pre-populated index to determine relevant URIs (and to
avoid as many irrelevant ones as possible)
Diﬀerent index keys possible; e.g., triple patterns [Umbrich
et al., 2011]
Index entries a set of URIs
Indexed URIs may appear multiple times (i.e., associated with
multiple index keys)
Each URI in such an entry may be paired with a cardinality
(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
48ICDCS’17/2017-06-07

Index Approaches
et al., 2011]
GET urii
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
48ICDCS’17/2017-06-07

Index Approaches
et al., 2011]
GET urii
Advantages
Data retrieval can be fully parallelized
Reduces the impact of data retrieval on query execution time
Disadvantages
Querying can only start after index construction
Depends on what has been selected for the index
Freshness may be an issue
Index maintenance
48ICDCS’17/2017-06-07

Hybrid Approach
Perform a traversal-based execution using a prioritized list of
URIs to look up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on information
in the index
New discovered URIs that are not in the index are ranked
according to number of referring documents
49ICDCS’17/2017-06-07

Outline
Hybrid approaches
4 Conclusions
50ICDCS’17/2017-06-07

Conclusions
RDF and Linked Object Data seem to have considerable
promise for Web data management
There are prototype systems that provide alternative solutions
There are commercial systems as well
See https://www.w3.org/wiki/SparqlImplementations
for a list
More work needs to be done
Query semantics
Adaptive system design
Optimizations – both in data warehousing and distributed
environments
Live querying requires signiﬁcant thought to reduce latency
51ICDCS’17/2017-06-07

Conclusions
What I did not talk about:
Not much on general distributed/parallel processing
Not much on SPARQL semantics
Nothing about RDFS – no schema stuﬀ
Nothing about entailment regimes > 0 ⇒ no reasoning
52ICDCS’17/2017-06-07

Thank you!
Research supported by
53ICDCS’17/2017-06-07

References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a
vertically partitioned DBMS for semantic web data management. VLDB J.,
18(2):385–406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable
semantic web data management using vertical partitioning. In Proc. 33rd
Int. Conf. on Very Large Data Bases, pages 411–422.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).
ANAPSID: an adaptive query processing engine for SPARQL endpoints. In
Proc. 10th Int. Semantic Web Conf., pages 18–34.
Alu¸c, G., Hartig, O., Özsu, M. T., and Daudjee, K. (2014a). Diversified stress
testing of RDF data management systems. In Proc. 13th Int. Semantic Web
Conf., pages 197–212.
Alu¸c, G., Özsu, M. T., and Daudjee, K. (2014b). Workload matters: Why RDF
databases need a new design. Proc. VLDB Endowment, 7(10):837–840.
54ICDCS’17/2017-06-07

References II
Alu¸c, G., Özsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a
workload-aware robust RDF data management system. Technical Report
CS-2013-10, University of Waterloo. Available at
https://cs.uwaterloo.ca/sites/ca.computer-science/files/
uploads/files/CS-2013-10.pdf.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,
Udrea, O., and Bhattacharjee, B. (2013). Building an efficient RDF store
over a relational database. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pages 121–132.
Galarraga, L., Hose, K., and Schenkel, R. (2014). Partout: a distributed engine
for efficient RDF processing. In Proc. 23rd Int. World Wide Web Conf.
(Companion Volume), pages 267–268.
Görlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation
exploiting VOID descriptions. In Proc. 2nd Int. Workshop on Consuming
Linked Data.
55ICDCS’17/2017-06-07

References III
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A
distributed shared-nothing RDF engine based on asynchronous message
passing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
289–300.
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., and Umbrich, J.
(2010). Data summaries for on-demand queries over linked data. In Proc.
19th Int. World Wide Web Conf., pages 411–420. Available from:
http://doi.acm.org/10.1145/1772690.1772733.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics and
computability. In Proc. 9th Extended Semantic Web Conf., pages 8–23.
Hartig, O. (2013a). An overview on execution strategies for linked data queries.
Datenbank-Spektrum, 13(2):89–99. Available from:
http://dx.doi.org/10.1007/s13222-013-0122-1.
Hartig, O. (2013b). SQUIN: a traversal based query execution system for the
web of linked data. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 1081–1084.
56ICDCS’17/2017-06-07

References IV
Hose, K. and Schenkel, R. (2013). WARP: Workload-aware replication and
partitioning for RDF. In Proc. Workshops of 29th Int. Conf. on Data
Engineering, pages 1–6.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of
large RDF graphs. Proc. VLDB Endowment, 4(11):1123–1134.
Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,
B. (2011). Heuristics-based query processing for large RDF graphs using
cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312–1327.
Kaoudi, Z. and Manolescu, I. (2015). RDF in the clouds: A survey. VLDB J.,
24:67–91.
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., and Castagna, P.
(2012). Jena-HBase: A distributed, scalable and eﬃcient RDF triple store.
In Proc. International Semantic Web Conference Posters & Demos Track.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In
Proc. 9th Int. Semantic Web Conf., pages 453–469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked
data. In Proc. 8th Extended Semantic Web Conf., pages 139–153.
57ICDCS’17/2017-06-07

References V
Lee, K. and Liu, L. (2013). Scaling queries over big RDF graphs with semantic
hash partitioning. Proc. VLDB Endowment, 6(14):1894–1905. Available
from: http://www.vldb.org/pvldb/vol6/p1894-lee.pdf.
Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF.
Proc. VLDB Endowment, 1(1):647–659.
Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable
management of RDF data. VLDB J., 19(1):91–113.
Özsu, M. T. (2016). A survey of RDF data management systems. Front.
Comput. Sci., 10(3):418–432.
Peng, P., Zou, L., Özsu, M. T., Chen, L., and Zhao, D. (2016). Processing
SPARQL queries over distributed RDF graphs. VLDB J., 25(2):243–268.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Efficient distributed query
processing for autonomous rdf databases. In Proc. 15th Int. Conf. on
Extending Database Technology, pages 372–383.
Quilitz, B. and Leser, U. (2008). Querying distributed RDF data sources with
SPARQL. In Proc. 5th European Semantic Web Conf., pages 524–538.
58ICDCS’17/2017-06-07

References VI
Rohloﬀ, K. and Schantz, R. E. (2010). High-performance, massively scalable
distributed systems using the mapreduce software framework: the shard
triple-store. In Proc. Int. Workshop on Programming Support Innovations
for Emerging Distributed Applications. Article No. 4.
Saleem, M. and Ngomo, A. N. (2014). HiBISCuS: Hypergraph-based source
selection for SPARQL endpoint federation. In Proc. 11th Extended Semantic
Web Conf., pages 176–191. Available from:
http://dx.doi.org/10.1007/978-3-319-07443-6_13.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011a).
FedX: A federation layer for distributed query processing on linked open
data. In Proc. 8th Extended Semantic Web Conf., pages 481–486.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011b).
Fedx: Optimization techniques for federated query processing on linked
data. In Proc. 10th Int. Semantic Web Conf., pages 601–616. Available
from: https://doi.org/10.1007/978-3-642-25073-6_38.
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).
Comparing data summaries for processing live queries over linked data.
World Wide Web J., 14(5-6):495–544.
59ICDCS’17/2017-06-07

References VII
Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing
for semantic web data management. Proc. VLDB Endowment,
1(1):1008–1019.
Wilkinson, K. (2006). Jena property table implementation. Technical Report
HPL-2006-140, HP Laboratories Palo Alto.
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., and Liu, L. (2013). TripleBit: a
fast and compact system for large scale RDF data. Proc. VLDB
Endowment, 6(7):517–528. Available from:
http://www.vldb.org/pvldb/vol6/p517-yuan.pdf.
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards
scalable I/O efficient SPARQL query evaluation on the cloud. In Proc. 29th
Int. Conf. on Data Engineering, pages 565–576.
Zou, L., Mo, J., Chen, L., Özsu, M. T., and Zhao, D. (2011). gStore:
answering SPARQL queries via subgraph matching. Proc. VLDB
Endowment, 4(8):482–493.
Zou, L. and Özsu, M. T. (2017). Graph-based RDF data management. Data
Science and Engineering, 2(1):56–70. Available from:
https://dx.doi.org/10.1007/s41019-016-0029-6.
60ICDCS’17/2017-06-07

References VIII
Zou, L., ¨Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).
gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565–590.
61ICDCS’17/2017-06-07

Web Data Management in the RDF Age

More Related Content

What's hot

Similar to Web Data Management in the RDF Age

Recently uploaded

Web Data Management in the RDF Age