Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Exchange and Consumption of
Huge RDF Data
Miguel A. Martínez-Prieto1,2 <migumar2@infor.uva.es>
Mario Arias1,3 <mario.arias@deri.org>
Javier D. Fernández1,2 <jfergar@infor.uva.es>

1. Department of Computer Science, Universidad de Valladolid (Spain)
2. Department of Computer Science, Universidad de Chile (Chile)
3. Digital Enterprise Research Institute, National University of Ireland Galway

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Sharing RDF in the Web of Data.

Parsing / Indexing
Reasoning
R
• Dataset analysis. I
• Setup a SPARQL server. P
• Vocabulary interlinking / integration.
• Browsing and Visualization.
sensor • Exchange between servers
• Data-intensive tasks.

dereferenceable URIs

RDF dump
SPARQL Endpoints/
APIs

Dataset Exchange Workflow

1º 2º 3º
Publication Exchange Consumption

Convert Transfer Decompress

If RDF is meant to be machine processable,
Serialize Parse
Why are we using plain text serialization formats??

Compress Index

HDT: RDF Binary Format

 Compact Data Structure for RDF.
 W3C Submission. http://www.w3.org/Submission/2011/03/
 Open Source C++/Java library.

HDT Focused on Querying

FoQ
 Contribution of this paper:
 A complementary Index to make the HDT fully queryable.
 Analysis on how HDT reduces exchange and indexing time.
 Evaluate in-memory query performance.

Dictionary

 Mapping of strings to correlative IDs. {1..n}
 Lexicographically sorted, no duplicates.
 Section compression explained at [8]

Triples Model

Triples
S 1 2 3
126
132
213 P[ 2 3] [ 1 2 ] [4 ] 3

224
225 O[ 6 ][ 2] [ ][
3 4 ] [5 ] [1 ] 2
241
332

Adjacency Lists

1 2 3

[ 2 , 3] [ ,
1 ,2 ] [4 ] 3
1 2 3 4 5 6

Array 2 3 1 2 4 3
Bitmap 1 0 1 0 0 1

 Operations:
– access(g) = Given a global position, get the value. O(1)
– findList(g) = Given a global position, get the list number. O(1)
O(log log n)
– first(l), last(l), = Given a list, find the first and last.

Triples Model and Coding

Triples
S 1 2 3
126
132
213 P 2 3 1 2 4 3

224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1

Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1

Searching by Subject

Triples
S 1 ( 2, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3

224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1

SPO, SP? Array Z 6 2 3 4 5 1 2
S??, S?O Bitmap Z 1 1 1 1 0 1 1

Searching by Predicate

Triples
S 1 ( ?, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3

224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1

?P? Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1

Wavelet Tree

 Compact Sequence of Integers {0,σ}.
rank(3, 7) = 2
2 3 6 3 6 1 2
1 3 6 2 5 2 4 1 4 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
9 16
select(6, 3) = 9

 access(position) = Value at position.
 rank(entry, position) = Number of appearances of O(log σ)
O(log σ)
“entry” up to “position”. O(log σ)
 select(entry, i) = Position where “entry” appears for the
i-th time.

Searching by Predicate w/ Wavelet

Triples
S 1 ( ?, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3

224
225 O 6 2 3 4 5 1 2
241
Wavelet Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1

?P? Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1

Triples: Object-Search

Triples
S 1 ( ?, ?, 2 ) 2 3
126
132
213 P 2 3 1 2 4 3

224
225 O 6 2 3 4 5 1 2
241
332

??O OP-Index [ 6 ][ 2 ][
7 ]3[ ] [4 ] [5 ] 1

?PO O1 O2 O3 O4 O5 O6

Data Structure Summary.

 From HDT to HDT-FoQ:
 Convert Array Y to Wavelet.
 Generate OP-Index.

 Triple Patterns:

SPO, SP?, S??, S?O Original HDT
?P? Wavelet Tree
?PO, ??O OP-Index

Evaluation Environment

Dataset Triples Size NTriples
LinkedMDB 6,1M 850 Mb
DBLP 73M 11,1 Gb
Geonames 112M 12,3 Gb
Producer: Consumer:
DBPedia 258M 37,3 Gb
Xeon @ 2.4Ghz Phenom-II @ 3.2Ghz
Datasets 96GB RAM 8GB RAM

Compressors: RDF Storage

• GZIP • Virtuoso
• LZMA • RDF-3x
• Hexastore

Compression Ratio

DBPedia

Geonames

hdt

gz
DBLP
lzma

hdt.gz
LinkedMDB
hdt.lzma

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Compression ratio (% against plain ntriples)

Publication Times

NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min
DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min
Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min
DBPedia 15,9 min 466 min 56 min 60 min 121 min

dbpedia

geonames

dblp

linkedMDB

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Times slower than Ntriples+GZIP

gz lzma hdt hdt.gz hdt.lzma

Publication Times2

NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min
DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min
Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min
DBPedia 15,9 min 466 min 56 min 60 min 121 min

dbpedia

geonames

dblp

linkedMDB

0 1 2 3 4 5 6 7 8 9 10 11 12 13
Times slower than Ntriples + GZIP

gz hdt hdt.gz hdt.lzma

Exchange & Decompression Time

GZIP

LZMA

HDT+GZIP

HDT+LZMA Exchange
Decompress

0 50 100 150 200 250 300
Seconds (Geometric Mean of all datasets)

*Assuming a Network Bandwidth of 2MByte/s

Overall Client Time

LZMA+Virtuoso

GZ+Virtuoso

Exchange
LZMA+RDF3x
Decompress
Index

GZ+RDF3x
LZMA+RDF3x HDT+LZMA
linkedMDB 2,1 min 9,21 sec
HDT+LZMA+FOQ
dblp 27 min 2,02 min
geonames 49,2 min 3,04 min
HDT+GZIP+FOQ dbpedia 159 min 14,3 min

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600
Seconds (Geometric mean of all datasets)

In-memory Data Store.

Triples Index Size (Mb)
Virtuoso Hexastore RDF3x HDT-FoQ
LinkedMDB 6,1M 518 6976 337 68
DBLP 46M 3982 - 3252 850
Geonames 112M 9216 - 6678 1435
DBPedia 258M - - 15802 5260

 Less size = more data in memory = less I/O access!

Query Performance, Triple Patterns

LinkedMDB Geonames
16 16
15 15
14 14 RDF-3x
13 13 Virtuoso
12 12
11 11
Times HDT Faster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
SP? S?O S?? ?PO ?P? ??O SP? S?O S?? ?PO ?P? ??O

Query Performance Two-way Joins

LinkedMDB Geonames
3 3

RDF-3x
Virtuoso
2.5 2.5

2 2
Times HDT Faster

1.5 1.5

1 1

0.5 0.5

0 0
SSbig SSsmall SObig SOsmall OObig OOsmall SSbig SSsmall SObig SOsmall OObig OOsmall

Conclusions

 Data is ready to be consumed 10-15x faster.
 Exchange time reduced.
 Indexing burden on server = Lightweight client processing.
 Competitive query performance.
 Very fast on triple patterns.
 Joins on the same scale of existing solutions.
 This is useful to you:
 If you need a fast, compact read-only in-memory RDF store.
 If you want to share self-queryable RDF dumps.
 If you need fast download & query.
 Addresses the volume issue of Big Data.

Future work.

 Full SPARQL support.
 UNION, Optional, Multiple Join.
 Optimized query evaluation.
 Integration:
 Jena, Any23…
 Diffussion.
 Get more people to use it!
 Additional services on top of HDT.
 SPARQL Endpoint.
 Distributed Stream Processing.
 Mobile Applications.

Thanks! http://www.rdf-hdt.org

Exchange and Consumption of Huge RDF Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exchange and Consumption of Huge RDF Data

Similar to Exchange and Consumption of Huge RDF Data (20)

Recently uploaded

Recently uploaded (20)

Exchange and Consumption of Huge RDF Data

Editor's Notes