... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
Democratizing Big Semantic Data management
1. Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
Privacy and Sustainable Computing Lab, Austria
STANFORD CENTER FOR BIOMEDICAL INFORMATICS
APRIL 25TH, 2018.
or how to query an RDF graph with 28 billion triples in a standard laptop
2. The Linked Open Data cloud (2018)
PAGE 2
~10K datasets organized into 9
domains which include many and
varied knowledge fields.
150B statements, including
entity descriptions and
(inter/intra-dataset) links between
them.
>500 live endpoints serving this
data.
http://lod-cloud.net/
http://stats.lod2.eu/
http://sparqles.ai.wu.ac.at/
3. But what about Web-scale queries
E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
Solutions?
3
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
5. A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
Problems?
Many SPARQL Endpoints have low availability
5
The Web of Data Eco System
http://sparqles.ai.wu.ac.at/
6. A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
Problems?
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeout,#results)
Moreover, it can be tricky with complex queries (joins) due to intermediary
results, delays, etc
6
The Web of Data Eco System
7. B) Follow-your-nose
1. Follow self-descriptive IRIs and links
2. Filter the results you are interested in
Problems?
You need some initial seed
DBpedia could be a good start
It’s slow (fetching many documents)
Where should I start for unbounded queries?
?x dcterms:title “WBGene00000001 (aap-1)"
7
The Web of Data Eco System
8. C) Use the RDF dumps by yourself
1. Crawl de Web of Data
Probably start with datahub.io, LOV, other catalogs?
2. Download datasets
You better have some free space in your machine
3. Index the datasets locally
You better are patience and survive parsing errors
4. Query all datasets
You better are alive by then
Problems?
Hugh resources!
+ Messiness of the data
8
The Web of Data Eco System
Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD
lab: Experiments at LOD scale. In ISWC
9. Publication, Exchange and Consumption of large RDF datasets
Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
Verbose = High costs to write/exchange/parse
A basic offline search = (decompress)+ index the file + search
The problem is in the roots
(Big tree = big roots)
Steve Garry
10. 1) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3%) memory footprint.
Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
same scale as current more optimized solutions
Challenges:
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then
it is ready to consume efficiently!)
10
A Linked Data hacker toolkit
431 M.triples~
63 GB
NT + gzip
5 GB
HDT
6.6 GB
Slightly more but you can query!
rdfhdt.org
12. Publication Metadata:
Publisher, Public Endpoint, …
Statistical Metadata:
Number of triples, subjects, entities, histograms…
Format Metadata:
Description of the data structure, e.g. Triples Order.
Additional Metadata:
Domain-specific.
… In RDF
Header
16. Mapping of strings to correlative IDs. {1..n}
Lexicographically sorted, no duplicates.
Prefix-Based compression in each section.
Efficient IDString operations
Dictionary
17. Prefix-Based compression used in each section
1. Each string is encoded with two values
An integer representing the number of characters shared with the previous
string
A sequence of characters representing the suffix that is not shared with the
previous string
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
(0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
18. Prefix-Based compression used in each section
2. The vocabulary is split in buckets, each of them storing “b” strings
The first string of each bucket (header) is coded explicitly (i.e. full string)
The subsequent b-1 strings (internal strings) are coded differentially
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,Best)
1 2 3 4 5 6
19. Prefix-Based compression used in each section
3. PFC is encoded with a byte sequence and an array of pointers (ptr) to
denote the first byte of each bucket
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
20. Prefix-Based compression used in each section
Locate (string) performs a binary search in the headers + sequential decoding of
internal strings
e.g. locate (Antivirus Software)=5
Extract (id), finds the bucket id/b, and decodes until the given position
E.g. extract (5) = Antivirus Software
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., &
Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108.
ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
22. Represent and index large volumes of data
~ Theoretical minimum space while serving efficient operations:
Mostly based on 3 operations:
Access
Rank
Select
22
Remember…. Succinct Data Structures
23. Bitmap Sequence.
Operations in constant time
access(position) = Value.
rank(position) = “Number of ones, up to position”.
select(i) = “Position where the one has i occurrences”.
Implementation:
n + o(n) bits
Adjustable space overhead: In practice, 37,5 % overhead
23
Bit Sequence Coding
1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
rank(7) = 4
9
select(5) = 9
1
access(14) = 1
24. Bitmap Triples:
24
Bitmap Triples Encoding
subjects
Objects:
Predicates:
E.g. retrieve (2,5,?)
Find the position of the second ‘1’-bit in Bp (select)
Binary search on the list of predicates looking for 5
Note that such predicate 5 is in position 4 of Sp
Find the position of the four ‘1’-bit in Bo (select)
S P O
S P ?
S ? O
S ? ?
? ? ?
25. E.g. retrieve (?,4,?)
Get the location of the list, Bl.select(3)+1= 4+1 =5
Retrieve the position: Sl[5]=1
Get the associated subject, rank(1)=1st subject
access the object as (1,4,?)
Encoded as a position list
25
Additional Predicate Index
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5
? P ?
subjects
Objects:
Predicates:
26. Encoded as a position list
26
Additional Object Index
subjects
Objects:
Predicates:
? ? O
? P O
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5objects
2 5 6 4 3 3 1
0 0 1 1 1 1 1Bl
Sl
1 2 3 4 5
27. From the exchanged HDT to the functional HDT-FoQ:
Publish and Exchange HDT
At the consumer:
27
On-the-fly indexes: HDT-FoQ
Process Type of Index Patterns
index the bitsequences
Subject
SPO
SPO, SP?,
S??, S?O, ???
We index the position of each predicate
(just a position list)
Predicate
PSO ?P?
We index the position of each object
(just a position list)
Object
OPS ?PO, ??O
1
2
3
29. 29
Some numbers on size
http://dataweb.infor.uva.es/projects/hdt-mr/
José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A
Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of
International Semantic Web Conference (ISWC), 2015
28,362,198,927 Triples
30. Data is ready to be consumed 10-15x faster.
HDT << any other RDF format || RDF engine
Competitive query performance.
Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).
Integration with Jena
Joins on the same scale of existing solutions (Virtuoso, RDF3x).
30
Results
33. 33
A Linked Data hacker toolkit
uses HDT as the main storage solution
2) LOD Laundromat
Challenges:
Still you need to query 650K datasets
Of course it does not contain all LOD, but “a good approximation”
http://lodlaundromat.org/
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a
uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
34. 3) Linked Data Fragments
Challenges:
Still room for optimization for complex federated queries (delays,
intermediate results, …)
34
A Linked Data hacker toolkit
typically uses HDT as the main engine
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014).
Querying datasets on the web with high availability. In ISWC (pp. 180-196).
35. Get more than 650K HDT datasets from
LOD Laundromat…
PAGE 35
37. Scalable storage
Store and serve thousands of (large) datasets (e.g. LOD Laundromat)
Archiving (reduce storage costs + foster smart consumption)
Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)
Better consumer-centric publication
Compress and share ready-to-consume RDF datasets
Consumption with limited resources
smartphones, standard laptops
Fast –low cost- SPARQL Query Engine
Via HDT-Jena
Via Linked Data Fragments
Application scenarios (HDT)
38. Storage + Light API
http://lodlaundromat.org/
http://linkeddatafragments.org/
Storage + SPARQL Query Engine
https://data.world/
Advance features
Top k Shortest Path
Query Answering over the Web of Data
Others: Versioning, streaming,…
Application Examples
39. Application Examples
Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo
Uri resolver based on HDT: https://github.com/pharmbio/urisolve
91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
40. But what about Web-scale queries
E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)
Solutions?
40
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
45. Query resolution at Web scale
Using LDF, Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
45
LOD-a-lot (some use cases)
subjects predicates objects
46. Identity closure
?x owl:sameAs ?y
Graph navigations
E.g. shortest path, random walk
46
LOD-a-lot (some use cases)
Wouter Beek, Javier D.
Fernández and Ruben Verborgh.
LOD-a-lot: A Single-File Enabler
for Data Science. In Proc. of
SEMANTiCS 2017.
More use cases:
47. Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Leverage the HDT indexes to support “data science”
E.g. get links across datasets, study the topology of the network, optimize
query planning
Support provenance of the triples (i.e. origin of each triple)
Currently supported only via LOD Laundromat
… implement the use cases and help the community to democratize
the access to LOD
Roadmap
48. We are currently facing Big Linked Data challenges
Generation, publication and consumption
Archiving, evolution…
Thanks to compression/HDT, the Big Linked Data
today will be the “pocket” data tomorrow
HDT democratizes the access to Big Linked Data
= Cheap, scalable consumers
low-cost access to LOD = high-impact research
PAGE 48
Take-home messages
50. Thank you!
javier.fernandez@wu.ac.at
Kudos to all the co-authors involved in the works presented here
Incomplete list of ACKs:
Miguel A. Martínez-Prieto
Mario Arias
Pablo de la Fuente
Claudio Gutierrez
Axel Polleres
Wouter Beek
Ruben Verborgh
… And many others