Linked open data - how to juggle with more than a billion triples

How to Juggle with more
than a Billion Triples?

Ansgar Scherp
Research Group on Data and
Web Science

Universität Mannheim
October 2012
Image source:
http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1
Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide

My thanks go to …
• Marianna • Daniel Eißing
• Simon Schenk • Mathias Konrath
• Carsten Saathoff • Daniel Schmeiß
• Thomas Franz • Anton Baumesberger
• Thomas Gottron • Frederik Jochum
• Steffen Staab • Alexander Kleinen
• Arne Peters
• Bastian Krayer And many more …

Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 2

Scenario

• Tim plans to travel
– from London
– to a customer in Cologne


Website of the German Railway

It works, why bother…?

Let„s Try Different Queries

 Bottlenecks in public transportation?
 Compare the connections with flights?
 Visualize on a map?
…

 All these queries cannot be answered,
because the data …


… locked in Silos!

– High Integration Effort
– Lack in Reuse of Data
B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY

Linked Data
• Publishing and interlinking of data
• Different quality and purpose
• From different sources in the Web

World Wide Web Linked Data
Documents Data
Hyperlinks Typed Links
HTML RDF
Addresses (URIs) Addresses (URIs)

Example: http://www.uni-mannheim.de/

Relevance of Linked Data?


Linked Data: May „07  Sept. „11
Web 2.0

Media

Publications

eGovernment

Cross-Domain

Life
Geographic Sciences

Ansgar Billion–Triples
< 31 Scherp ansgar@informatik.uni-mannheim.de Source: http://lod-cloud.net
Slide 9

Linked Data Principles

1. Identification
2. Interlinkage
3. Dereferencing
4. Description


Example: Big Lynx
Matt Briggs

Scott Miller
?
Big Lynx
Company

Ansgar Scherp – ansgar@informatik.uni-mannheim.de
< 31 Milliarde Triple Source: http://lod-cloud.net
Slide 11

1. Use URIs for Identification

Matt Briggs

Scott Miller
http://biglynx.co.uk/
people/matt-briggs
people/scott-miller

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 12

Example: Big Lynx
Matt Briggs

Scott Miller
Big Lynx
Company

 How to model relationships like knows?


Resource DescriptionFramework (RDF)
• Description of Ressources with RDF triple
Matt Briggs is a Person

Subject Predicate Object

@prefix rdf:<http://w3.org/1999/02/22-rdf-
syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs>
rdf:type foaf:Person .

1. Use URIs also for Relations

people/matt-briggs

people/scott-miller

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 15

Example: Big Lynx
Dave Smith
London
„lives here―

Matt Briggs

„same
Scott Miller
Big Lynx
… person―
Company

DBpedia Matt Briggs

Matts private
Webseite

2. Establishing Interlinkage
• Relation links between ressources
<http://biglynx.co.uk/people/dave-smith>
foaf:based_near
<http://dbpedia.org/resource/London> .

 Identity links between ressources
<http://biglynx.co.uk/people/matt-briggs>
owl:sameAs
<http://www.matt-briggs.eg.uk#me> .

Example: Big Lynx
Dave Smith
London
„lives here―
foaf:based_near

Matt Briggs

„same
owl:sameAs
Person― Big Lynx
Company

DBpedia Matt Briggs

Matts private
Webseite

3. Dereferencing of URIs

• Looking up of web documents

• How can we ―look up‖ things of the real world?

people/matt-briggs


Two Approaches
1. Hash URIs
– URI contains a part separated by #, e.g.,
http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other― request
http://biglynx.co.uk/people/matt-briggs
Response: „Look here:―
http://biglynx.co.uk/people/matt-briggs.rdf


Example: Big Lynx
Dave Smith
London
foaf:based_near

Description of
Matt Briggs
Matt?
owl:sameAs
Big Lynx
Company

DBpedia Matt Briggs

Matts private
Webseite

4. Description of URIs
foaf:Person …
… dp:Birmingham
rdf:type
foaf:based_near …

biglynx:matt-briggs ex:loc
_:point
foaf:knows
wgs84:
wgs84: long
biglynx:dave-smith
lat
―-0.118‖
foaf:based_near
―51.509‖
dp:London

… …

Formalization of Description
 Given a RDF graph G (V , P, E ) with
V R B L and E ( R B) P V

∩∞
 SimpleCBD(n) = I j with
j=0

I 0 = { (s, p, o) | (s, p, o) E s=n}

I j+1 = { (o, p‗, o‗) E| (s, p, o) Ij : o B
∩j
(o, p‗, o‗) Ik}
k=0


W3C RDF / RDF Schema Vocabulary
• Set of URIs defined in rdf:/rdfs: namespace
• rdf:type • rdfs:domain
• rdf:Property • rdfs:range
• rdf:XMLLiteral • rdfs:Resource
• rdf:List • rdfs:Literal
• rdf:first • rdfs:Datatype
• rdf:rest • rdfs:Class
• rdf:Seq • rdfs:subClassOf
• rdf:Bag • rdfs:subPropertyOf
• rdf:Alt • rdfs:comment
• ... • …
• rdf:value • rdfs:label

Semantic Web Layer Cake (Simplified)


Exploration of Linked Data

Word
Net

Swoogle

Geo
Names
< 31 Billion Triples Source: http://lod-cloud.net
Slide 26

Naive Approach
• Download all data
• Store in really big
database RDFS
• Programming of WordNet Rules
queries Swoogle Geo
• Design of
user interface GeoNames

Inflexible Monolithic
Not
scaleable
Slide 27

SemaPlorer Approach
Flexible

Extensible

Scaleable
birthplace

placeOfBirth
birthplace

Geo
RDFS Rules Fulltext Queries > 1 Billion
Triples
WordNet + + Swoogle + + GeoNames
12 Month in 2005/06
Ansgar Scherp – ansgar@informatik.uni-mannheim.de  700 Mio. Triple Slide 28

SemaPlorer – Semantic Social Media

Ansgar Scherpvideo online: http://vimeo.com/2057249
Watch – ansgar@informatik.uni-mannheim.de Slide 29

Billion Triple Challenge 2008

[JWS 2009]

Searching for Linked Data Sources

?
Persons that are
- Politicians and
- Actors
?

<Ansgar Scherp – ansgar@informatik.uni-mannheim.de
31 Milliarde Triples Quelle: http://lod-cloud.net
Slide 31

Idea: Index of Data Sources
SELECT ?x
FROM …
WHERE {
?x rdf:type ex:Actor .
?x rdf:type ex:Politician .
}

Index

?
Query

“Politician and
Actor”

The Naive Approach
1. Download the entire LOD cloud
2. Put it into a (really) large triple store
3. Process the data and extract schema
4. Provide lookup

- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud


Idea
 Schema-level index
 Define families of graph patterns
 Assign instances to graph patterns
 Map graph patterns to context (source URI)
 Construction
 Stream-based for scalability
 Little loss of accuracy
 Note
 Index defined over instances
 But stores the context

Input Data
 n-Quads
<subject> <predicate> <object> <context>
 Example:
<http://www.w3.org/People/Connolly/#me>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#
<http://xmlns.com/foaf/0.1/Person>
<http://dig.csail.mit.edu/2008/webdav/timbl/
http://dig.csail.mit.edu/2008/
webdav/timbl/foaf.rdf
w3p:
#me
foaf:
Person


SchemEX Approach
• Stream-based schema extraction
• While crawling the data

FIFO
LOD-Crawler Instance-
RDF-Dump Cache RDF
Triple Store RDBMS
NxParser

Nquad- Schema- Schema-
Parser
Stream Extractor Level
Index

Building the Index from a Stream
 Stream of n-quads (coming from a LD crawler)
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo
1
C3 4
6
C2 3
4
2
C2 2
1 3
C1 5

• Linear runtime complexity wrt # of input triples

Building the Schema and Index
RDF
C1 C2 C3 … Ck
classes
consistsOf
Type
TC1 TC2 … TCm clusters
hasEQ
Class p1 p2
EQC1 EQC2 … EQCn Equivalence
classes
hasDataSource

… Data
DS1 DS2 DS3 DS4 DS5 DSx sources

Layer 1: RDF Classes
 All instances of a C1
particular type
DS 1 DS 2 DS 3

SELECT ?x
FROM …
WHERE {
?x rdfs:type foaf:Person .
foaf:Person
}

http://dig.csail.mit.edu/2008/...
foaf:
timbl: Person
card#i http://www.w3.org/People/Berners-Lee/card


Layer 2: Type Clusters
 All instances belonging C1 C2

to exactly the same set
TC1
of types
SELECT ?x DS 1 DS 2 DS 3
FROM …
WHERE {
foaf:Person pim:Male
?x rdfs:type foaf:Person .
?x rdfs:type pim:Male . tc4711
}
pim:
Male
http://www.w3.org/People/Berners-Lee/card
foaf:
timbl:
Person
card#i

Layer 3: Equivalence Classes
 Two instances are C1 C2 C3

equivalent iff:
 They are in the same TC TC1 TC2

 They have the same p
properties
EQC1
 The property targets are
in the same TC DS 1 DS 2 DS 3

 Similar to 1-Bisimulation

Layer 3: Equivalence Classes
SELECT ?x
WHERE {
?x rdfs:type foaf:Person foaf:Person
.
?x rdfs:type pim:Male . pim:Male foaf:PPD
?x foaf:maker ?y .
?y rdfs:type
foaf:PersonalProfileDocument .
tc4711 tc1234
} eqc0815
-maker-
pim: foaf: foaf: tc1234
Male Person PPD
eqc0815
foaf:maker

timbl: http://www.w3.org/People/Berners-Lee/card
timbl: card
card#i

Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set
• 11 M triples, TimBL‘s FOAF profile
• LDspider with ~ 2k triples / sec

• Different cache sizes: 100, 1k, 10k, 50k, 100k
• Compared SchemEX with reference schema
• Index queries on all Types, TCs, EQCs
• Good precision/recall ratio at 50k+
• Commodity hardware (4GB RAM, single CPU)

Quality of Stream-based Index
Construction

+ Runtime increases hardly with window size
+ Memory consumption scales with window size

Computing SchemEX: Full BTC 2011 Data

Cache size: 50 k

Billion Triple Challenge 2011

…

[JWS 2012]

And 2012? Get the Google Feeling!


Semantic Data Management Chain
• Research topics in a greater context

SchemEX* OntoMDE SemaPlorer*

Publish Collect Aggregate Use

Kreuzverweis.com Core Ontologies

Mobile Facets
* Winner of Billion Triple Challenge 2011/2008
 See at: dws.informatik.uni-mannheim.de 

Recommended Readings
• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483
(2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x
• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp:
SemaPlorer - Interactive semantic exploration of data and media based on
a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)
URL: http://dx.doi.org/10.1016/j.websem.2009.09.006
• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp:
SchemEX — Efficient construction of a data catalogue by stream-based
indexing of linked data, J. of Web Semantics: Science, Services and
Agents on the World Wide Web, Available online 23 June 2012
URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
Data Space, Morgan & Claypool Publishers, 2011
URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001


Linked open data - how to juggle with more than a billion triples

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Linked open data - how to juggle with more than a billion triples

Similar to Linked open data - how to juggle with more than a billion triples (20)

More from Ansgar Scherp

More from Ansgar Scherp (9)

Recently uploaded

Recently uploaded (20)

Linked open data - how to juggle with more than a billion triples