SlideShare a Scribd company logo
1 of 50
Download to read offline
Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
Privacy and Sustainable Computing Lab, Austria
STANFORD CENTER FOR BIOMEDICAL INFORMATICS
APRIL 25TH, 2018.
or how to query an RDF graph with 28 billion triples in a standard laptop
The Linked Open Data cloud (2018)
PAGE 2
 ~10K datasets organized into 9
domains which include many and
varied knowledge fields.
 150B statements, including
entity descriptions and
(inter/intra-dataset) links between
them.
 >500 live endpoints serving this
data.
http://lod-cloud.net/
http://stats.lod2.eu/
http://sparqles.ai.wu.ac.at/
But what about Web-scale queries
 E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
 Solutions?
3
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
4
Let’s fish in our Linked Data Eco System
A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
5
The Web of Data Eco System
http://sparqles.ai.wu.ac.at/
A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
 SPARQL Endpoints are usually restricted (timeout,#results)
 Moreover, it can be tricky with complex queries (joins) due to intermediary
results, delays, etc
6
The Web of Data Eco System
B) Follow-your-nose
1. Follow self-descriptive IRIs and links
2. Filter the results you are interested in
 Problems?
 You need some initial seed
 DBpedia could be a good start
 It’s slow (fetching many documents)
 Where should I start for unbounded queries?
 ?x dcterms:title “WBGene00000001 (aap-1)"
7
The Web of Data Eco System
C) Use the RDF dumps by yourself
1. Crawl de Web of Data
 Probably start with datahub.io, LOV, other catalogs?
2. Download datasets
 You better have some free space in your machine
3. Index the datasets locally
 You better are patience and survive parsing errors
4. Query all datasets
 You better are alive by then
 Problems?
 Hugh resources!
 + Messiness of the data
8
The Web of Data Eco System
Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD
lab: Experiments at LOD scale. In ISWC
 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
The problem is in the roots
(Big tree = big roots)
Steve Garry
1) HDT
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
 Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
same scale as current more optimized solutions
 Challenges:
 Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then
it is ready to consume efficiently!)
10
A Linked Data hacker toolkit
431 M.triples~
63 GB
NT + gzip
5 GB
HDT
6.6 GB
Slightly more but you can query!
rdfhdt.org
11
HDT (Header-Dictionary-Triples) Overview
RDF
Header
Dictionary
Triples
1 aa..
2 ab..
3 bu ..
metadata describing the RDF dataset
Mapping between IDs elements in the dataset
aa..
ab..
bu
..
3
2
1
Structure of the data after the ID replacement
 Publication Metadata:
 Publisher, Public Endpoint, …
 Statistical Metadata:
 Number of triples, subjects, entities, histograms…
 Format Metadata:
 Description of the data structure, e.g. Triples Order.
 Additional Metadata:
 Domain-specific.
 … In RDF
Header
13
Header
14
Dictionary+Triples partition
15
Dictionary+Triples partition
<http://example.org/Vienna>
<http://example.org/Javier>
<http://example.org/Paul>
<http://example.org/Researcher>
<http://example.org/Stefan>
ex:birthPlace
ex:workPlace
foaf:mbox
foaf:name
rdf:type
“jfergar@example.org”
“jfergar@wu.ac.at”
“Vienna”@en
1
2
3
4
5
6
7
8
9
10
11
12
13
2 1
7
12
13
9
4
10
8
5
6
3
6
11
8
 Mapping of strings to correlative IDs. {1..n}
 Lexicographically sorted, no duplicates.
 Prefix-Based compression in each section.
 Efficient IDString operations
Dictionary
 Prefix-Based compression used in each section
1. Each string is encoded with two values
 An integer representing the number of characters shared with the previous
string
 A sequence of characters representing the suffix that is not shared with the
previous string
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
(0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
 Prefix-Based compression used in each section
2. The vocabulary is split in buckets, each of them storing “b” strings
 The first string of each bucket (header) is coded explicitly (i.e. full string)
 The subsequent b-1 strings (internal strings) are coded differentially
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,Best)
1 2 3 4 5 6
 Prefix-Based compression used in each section
3. PFC is encoded with a byte sequence and an array of pointers (ptr) to
denote the first byte of each bucket
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
 Prefix-Based compression used in each section
 Locate (string) performs a binary search in the headers + sequential decoding of
internal strings
 e.g. locate (Antivirus Software)=5
 Extract (id), finds the bucket id/b, and decodes until the given position
 E.g. extract (5) = Antivirus Software
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., &
Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108.
ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
subjects
Objects:
Predicates:
 Bitmap Triples:
21
Bitmap Triples Encoding
 We index the bitsequences to provide a SPO index
 Represent and index large volumes of data
 ~ Theoretical minimum space while serving efficient operations:
 Mostly based on 3 operations:
 Access
 Rank
 Select
22
Remember…. Succinct Data Structures
 Bitmap Sequence.
 Operations in constant time
 access(position) = Value.
 rank(position) = “Number of ones, up to position”.
 select(i) = “Position where the one has i occurrences”.
 Implementation:
 n + o(n) bits
 Adjustable space overhead: In practice, 37,5 % overhead
23
Bit Sequence Coding
1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
rank(7) = 4
9
select(5) = 9
1
access(14) = 1
 Bitmap Triples:
24
Bitmap Triples Encoding
subjects
Objects:
Predicates:
 E.g. retrieve (2,5,?)
 Find the position of the second ‘1’-bit in Bp (select)
 Binary search on the list of predicates looking for 5
 Note that such predicate 5 is in position 4 of Sp
 Find the position of the four ‘1’-bit in Bo (select)
S P O
S P ?
S ? O
S ? ?
? ? ?
 E.g. retrieve (?,4,?)
 Get the location of the list, Bl.select(3)+1= 4+1 =5
 Retrieve the position: Sl[5]=1
 Get the associated subject, rank(1)=1st subject
 access the object as (1,4,?)
 Encoded as a position list
25
Additional Predicate Index
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5
? P ?
subjects
Objects:
Predicates:
 Encoded as a position list
26
Additional Object Index
subjects
Objects:
Predicates:
? ? O
? P O
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5objects
2 5 6 4 3 3 1
0 0 1 1 1 1 1Bl
Sl
1 2 3 4 5
 From the exchanged HDT to the functional HDT-FoQ:
 Publish and Exchange HDT
 At the consumer:
27
On-the-fly indexes: HDT-FoQ
Process Type of Index Patterns
index the bitsequences
Subject
SPO
SPO, SP?,
S??, S?O, ???
We index the position of each predicate
(just a position list)
Predicate
PSO ?P?
We index the position of each object
(just a position list)
Object
OPS ?PO, ??O
1
2
3
 http://www.w3.org/Submission/2011/03/
28
HDT Acknowledged as
W3C member submission:
29
Some numbers on size
http://dataweb.infor.uva.es/projects/hdt-mr/
José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A
Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of
International Semantic Web Conference (ISWC), 2015
28,362,198,927 Triples
 Data is ready to be consumed 10-15x faster.
 HDT << any other RDF format || RDF engine
 Competitive query performance.
 Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).
 Integration with Jena
 Joins on the same scale of existing solutions (Virtuoso, RDF3x).
30
Results
31
rdfhdt.org community
https://github.com/rdfhdt,
C++ and Java tools
Only in the last two weeks…
HDT-cpp
33
A Linked Data hacker toolkit
uses HDT as the main storage solution
2) LOD Laundromat
 Challenges:
 Still you need to query 650K datasets
 Of course it does not contain all LOD, but “a good approximation”
http://lodlaundromat.org/
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a
uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
3) Linked Data Fragments
 Challenges:
 Still room for optimization for complex federated queries (delays,
intermediate results, …)
34
A Linked Data hacker toolkit
typically uses HDT as the main engine
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014).
Querying datasets on the web with high availability. In ISWC (pp. 180-196).
Get more than 650K HDT datasets from
LOD Laundromat…
PAGE 35
PAGE 36
And query them online
 Scalable storage
 Store and serve thousands of (large) datasets (e.g. LOD Laundromat)
 Archiving (reduce storage costs + foster smart consumption)
 Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)
 Better consumer-centric publication
 Compress and share ready-to-consume RDF datasets
 Consumption with limited resources
 smartphones, standard laptops
 Fast –low cost- SPARQL Query Engine
 Via HDT-Jena
 Via Linked Data Fragments
Application scenarios (HDT)
 Storage + Light API
 http://lodlaundromat.org/
 http://linkeddatafragments.org/
 Storage + SPARQL Query Engine
 https://data.world/
 Advance features
 Top k Shortest Path
 Query Answering over the Web of Data
 Others: Versioning, streaming,…
Application Examples
Application Examples
Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo
Uri resolver based on HDT: https://github.com/pharmbio/urisolve
91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
But what about Web-scale queries
 E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)
 Solutions?
40
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
LOD-a-lot
41
But what about Web-scale queries
- flashback -
42
-crawl-
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
-clean-
-index&store-
SPARQL
endpoint
(metadata)
LOD-a-lot
LOD-a-lot-integrate-
28B triples
 Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
 Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
 LDF page resolution in milliseconds.
43
LOD-a-lot (some numbers)
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
44
https://datahub.io/dataset/lod-a-lot
http://purl.org/HDT/lod-a-lot
LOD-a-lot
 Query resolution at Web scale
 Using LDF, Jena
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
45
LOD-a-lot (some use cases)
subjects predicates objects
 Identity closure
 ?x owl:sameAs ?y
 Graph navigations
 E.g. shortest path, random walk
46
LOD-a-lot (some use cases)
Wouter Beek, Javier D.
Fernández and Ruben Verborgh.
LOD-a-lot: A Single-File Enabler
for Data Science. In Proc. of
SEMANTiCS 2017.
More use cases:
 Update LOD-a-lot regularly
 More and newer datasets from the LOD Cloud
 Leverage the HDT indexes to support “data science”
 E.g. get links across datasets, study the topology of the network, optimize
query planning
 Support provenance of the triples (i.e. origin of each triple)
 Currently supported only via LOD Laundromat
 … implement the use cases and help the community to democratize
the access to LOD
Roadmap
 We are currently facing Big Linked Data challenges
 Generation, publication and consumption
 Archiving, evolution…
 Thanks to compression/HDT, the Big Linked Data
today will be the “pocket” data tomorrow
 HDT democratizes the access to Big Linked Data
= Cheap, scalable consumers
low-cost access to LOD = high-impact research
PAGE 48
Take-home messages
49
ACKs
Thank you!
javier.fernandez@wu.ac.at
Kudos to all the co-authors involved in the works presented here
Incomplete list of ACKs:
Miguel A. Martínez-Prieto
Mario Arias
Pablo de la Fuente
Claudio Gutierrez
Axel Polleres
Wouter Beek
Ruben Verborgh
… And many others

More Related Content

What's hot

Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
A Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataA Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataOlaf Hartig
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...andrea huang
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mariana Damova, Ph.D
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 

What's hot (19)

2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
A Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataA Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked Data
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 

Similar to Democratizing Big Semantic Data management

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Fru Louis
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...Franck Michel
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 

Similar to Democratizing Big Semantic Data management (20)

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshow
 
Hadoop
HadoopHadoop
Hadoop
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 

Recently uploaded

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 

Recently uploaded (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 

Democratizing Big Semantic Data management

  • 1. Democratizing Big Semantic Data management Javier D. Fernández WU Vienna, Austria Complexity Science Hub Vienna, Austria Privacy and Sustainable Computing Lab, Austria STANFORD CENTER FOR BIOMEDICAL INFORMATICS APRIL 25TH, 2018. or how to query an RDF graph with 28 billion triples in a standard laptop
  • 2. The Linked Open Data cloud (2018) PAGE 2  ~10K datasets organized into 9 domains which include many and varied knowledge fields.  150B statements, including entity descriptions and (inter/intra-dataset) links between them.  >500 live endpoints serving this data. http://lod-cloud.net/ http://stats.lod2.eu/ http://sparqles.ai.wu.ac.at/
  • 3. But what about Web-scale queries  E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)  Solutions? 3 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  • 4. 4 Let’s fish in our Linked Data Eco System
  • 5. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability 5 The Web of Data Eco System http://sparqles.ai.wu.ac.at/
  • 6. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability  SPARQL Endpoints are usually restricted (timeout,#results)  Moreover, it can be tricky with complex queries (joins) due to intermediary results, delays, etc 6 The Web of Data Eco System
  • 7. B) Follow-your-nose 1. Follow self-descriptive IRIs and links 2. Filter the results you are interested in  Problems?  You need some initial seed  DBpedia could be a good start  It’s slow (fetching many documents)  Where should I start for unbounded queries?  ?x dcterms:title “WBGene00000001 (aap-1)" 7 The Web of Data Eco System
  • 8. C) Use the RDF dumps by yourself 1. Crawl de Web of Data  Probably start with datahub.io, LOV, other catalogs? 2. Download datasets  You better have some free space in your machine 3. Index the datasets locally  You better are patience and survive parsing errors 4. Query all datasets  You better are alive by then  Problems?  Hugh resources!  + Messiness of the data 8 The Web of Data Eco System Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD lab: Experiments at LOD scale. In ISWC
  • 9.  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search The problem is in the roots (Big tree = big roots) Steve Garry
  • 10. 1) HDT  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.  Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the same scale as current more optimized solutions  Challenges:  Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently!) 10 A Linked Data hacker toolkit 431 M.triples~ 63 GB NT + gzip 5 GB HDT 6.6 GB Slightly more but you can query! rdfhdt.org
  • 11. 11 HDT (Header-Dictionary-Triples) Overview RDF Header Dictionary Triples 1 aa.. 2 ab.. 3 bu .. metadata describing the RDF dataset Mapping between IDs elements in the dataset aa.. ab.. bu .. 3 2 1 Structure of the data after the ID replacement
  • 12.  Publication Metadata:  Publisher, Public Endpoint, …  Statistical Metadata:  Number of triples, subjects, entities, histograms…  Format Metadata:  Description of the data structure, e.g. Triples Order.  Additional Metadata:  Domain-specific.  … In RDF Header
  • 16.  Mapping of strings to correlative IDs. {1..n}  Lexicographically sorted, no duplicates.  Prefix-Based compression in each section.  Efficient IDString operations Dictionary
  • 17.  Prefix-Based compression used in each section 1. Each string is encoded with two values  An integer representing the number of characters shared with the previous string  A sequence of characters representing the suffix that is not shared with the previous string Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best (0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
  • 18.  Prefix-Based compression used in each section 2. The vocabulary is split in buckets, each of them storing “b” strings  The first string of each bucket (header) is coded explicitly (i.e. full string)  The subsequent b-1 strings (internal strings) are coded differentially Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,Best) 1 2 3 4 5 6
  • 19.  Prefix-Based compression used in each section 3. PFC is encoded with a byte sequence and an array of pointers (ptr) to denote the first byte of each bucket Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  • 20.  Prefix-Based compression used in each section  Locate (string) performs a binary search in the headers + sequential decoding of internal strings  e.g. locate (Antivirus Software)=5  Extract (id), finds the bucket id/b, and decodes until the given position  E.g. extract (5) = Antivirus Software Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., & Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108. ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  • 21. subjects Objects: Predicates:  Bitmap Triples: 21 Bitmap Triples Encoding  We index the bitsequences to provide a SPO index
  • 22.  Represent and index large volumes of data  ~ Theoretical minimum space while serving efficient operations:  Mostly based on 3 operations:  Access  Rank  Select 22 Remember…. Succinct Data Structures
  • 23.  Bitmap Sequence.  Operations in constant time  access(position) = Value.  rank(position) = “Number of ones, up to position”.  select(i) = “Position where the one has i occurrences”.  Implementation:  n + o(n) bits  Adjustable space overhead: In practice, 37,5 % overhead 23 Bit Sequence Coding 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 rank(7) = 4 9 select(5) = 9 1 access(14) = 1
  • 24.  Bitmap Triples: 24 Bitmap Triples Encoding subjects Objects: Predicates:  E.g. retrieve (2,5,?)  Find the position of the second ‘1’-bit in Bp (select)  Binary search on the list of predicates looking for 5  Note that such predicate 5 is in position 4 of Sp  Find the position of the four ‘1’-bit in Bo (select) S P O S P ? S ? O S ? ? ? ? ?
  • 25.  E.g. retrieve (?,4,?)  Get the location of the list, Bl.select(3)+1= 4+1 =5  Retrieve the position: Sl[5]=1  Get the associated subject, rank(1)=1st subject  access the object as (1,4,?)  Encoded as a position list 25 Additional Predicate Index predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5 ? P ? subjects Objects: Predicates:
  • 26.  Encoded as a position list 26 Additional Object Index subjects Objects: Predicates: ? ? O ? P O predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5objects 2 5 6 4 3 3 1 0 0 1 1 1 1 1Bl Sl 1 2 3 4 5
  • 27.  From the exchanged HDT to the functional HDT-FoQ:  Publish and Exchange HDT  At the consumer: 27 On-the-fly indexes: HDT-FoQ Process Type of Index Patterns index the bitsequences Subject SPO SPO, SP?, S??, S?O, ??? We index the position of each predicate (just a position list) Predicate PSO ?P? We index the position of each object (just a position list) Object OPS ?PO, ??O 1 2 3
  • 29. 29 Some numbers on size http://dataweb.infor.uva.es/projects/hdt-mr/ José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of International Semantic Web Conference (ISWC), 2015 28,362,198,927 Triples
  • 30.  Data is ready to be consumed 10-15x faster.  HDT << any other RDF format || RDF engine  Competitive query performance.  Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).  Integration with Jena  Joins on the same scale of existing solutions (Virtuoso, RDF3x). 30 Results
  • 32. https://github.com/rdfhdt, C++ and Java tools Only in the last two weeks… HDT-cpp
  • 33. 33 A Linked Data hacker toolkit uses HDT as the main storage solution 2) LOD Laundromat  Challenges:  Still you need to query 650K datasets  Of course it does not contain all LOD, but “a good approximation” http://lodlaundromat.org/ Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
  • 34. 3) Linked Data Fragments  Challenges:  Still room for optimization for complex federated queries (delays, intermediate results, …) 34 A Linked Data hacker toolkit typically uses HDT as the main engine Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying datasets on the web with high availability. In ISWC (pp. 180-196).
  • 35. Get more than 650K HDT datasets from LOD Laundromat… PAGE 35
  • 36. PAGE 36 And query them online
  • 37.  Scalable storage  Store and serve thousands of (large) datasets (e.g. LOD Laundromat)  Archiving (reduce storage costs + foster smart consumption)  Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)  Better consumer-centric publication  Compress and share ready-to-consume RDF datasets  Consumption with limited resources  smartphones, standard laptops  Fast –low cost- SPARQL Query Engine  Via HDT-Jena  Via Linked Data Fragments Application scenarios (HDT)
  • 38.  Storage + Light API  http://lodlaundromat.org/  http://linkeddatafragments.org/  Storage + SPARQL Query Engine  https://data.world/  Advance features  Top k Shortest Path  Query Answering over the Web of Data  Others: Versioning, streaming,… Application Examples
  • 39. Application Examples Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo Uri resolver based on HDT: https://github.com/pharmbio/urisolve 91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
  • 40. But what about Web-scale queries  E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)  Solutions? 40 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  • 41. LOD-a-lot 41 But what about Web-scale queries - flashback -
  • 42. 42 -crawl- LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data -clean- -index&store- SPARQL endpoint (metadata) LOD-a-lot LOD-a-lot-integrate- 28B triples
  • 43.  Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB  Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS  LDF page resolution in milliseconds. 43 LOD-a-lot (some numbers) 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  • 45.  Query resolution at Web scale  Using LDF, Jena  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 45 LOD-a-lot (some use cases) subjects predicates objects
  • 46.  Identity closure  ?x owl:sameAs ?y  Graph navigations  E.g. shortest path, random walk 46 LOD-a-lot (some use cases) Wouter Beek, Javier D. Fernández and Ruben Verborgh. LOD-a-lot: A Single-File Enabler for Data Science. In Proc. of SEMANTiCS 2017. More use cases:
  • 47.  Update LOD-a-lot regularly  More and newer datasets from the LOD Cloud  Leverage the HDT indexes to support “data science”  E.g. get links across datasets, study the topology of the network, optimize query planning  Support provenance of the triples (i.e. origin of each triple)  Currently supported only via LOD Laundromat  … implement the use cases and help the community to democratize the access to LOD Roadmap
  • 48.  We are currently facing Big Linked Data challenges  Generation, publication and consumption  Archiving, evolution…  Thanks to compression/HDT, the Big Linked Data today will be the “pocket” data tomorrow  HDT democratizes the access to Big Linked Data = Cheap, scalable consumers low-cost access to LOD = high-impact research PAGE 48 Take-home messages
  • 50. Thank you! javier.fernandez@wu.ac.at Kudos to all the co-authors involved in the works presented here Incomplete list of ACKs: Miguel A. Martínez-Prieto Mario Arias Pablo de la Fuente Claudio Gutierrez Axel Polleres Wouter Beek Ruben Verborgh … And many others