Successfully reported this slideshow.
Your SlideShare is downloading. ×

Democratizing Big Semantic Data management

Ad

Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
P...

Ad

The Linked Open Data cloud (2018)
PAGE 2
 ~10K datasets organized into 9
domains which include many and
varied knowledge ...

Ad

But what about Web-scale queries
 E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
 Soluti...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 50 Ad
1 of 50 Ad

More Related Content

Slideshows for you (19)

Similar to Democratizing Big Semantic Data management (20)

Democratizing Big Semantic Data management

  1. 1. Democratizing Big Semantic Data management Javier D. Fernández WU Vienna, Austria Complexity Science Hub Vienna, Austria Privacy and Sustainable Computing Lab, Austria STANFORD CENTER FOR BIOMEDICAL INFORMATICS APRIL 25TH, 2018. or how to query an RDF graph with 28 billion triples in a standard laptop
  2. 2. The Linked Open Data cloud (2018) PAGE 2  ~10K datasets organized into 9 domains which include many and varied knowledge fields.  150B statements, including entity descriptions and (inter/intra-dataset) links between them.  >500 live endpoints serving this data. http://lod-cloud.net/ http://stats.lod2.eu/ http://sparqles.ai.wu.ac.at/
  3. 3. But what about Web-scale queries  E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)  Solutions? 3 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  4. 4. 4 Let’s fish in our Linked Data Eco System
  5. 5. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability 5 The Web of Data Eco System http://sparqles.ai.wu.ac.at/
  6. 6. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability  SPARQL Endpoints are usually restricted (timeout,#results)  Moreover, it can be tricky with complex queries (joins) due to intermediary results, delays, etc 6 The Web of Data Eco System
  7. 7. B) Follow-your-nose 1. Follow self-descriptive IRIs and links 2. Filter the results you are interested in  Problems?  You need some initial seed  DBpedia could be a good start  It’s slow (fetching many documents)  Where should I start for unbounded queries?  ?x dcterms:title “WBGene00000001 (aap-1)" 7 The Web of Data Eco System
  8. 8. C) Use the RDF dumps by yourself 1. Crawl de Web of Data  Probably start with datahub.io, LOV, other catalogs? 2. Download datasets  You better have some free space in your machine 3. Index the datasets locally  You better are patience and survive parsing errors 4. Query all datasets  You better are alive by then  Problems?  Hugh resources!  + Messiness of the data 8 The Web of Data Eco System Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD lab: Experiments at LOD scale. In ISWC
  9. 9.  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search The problem is in the roots (Big tree = big roots) Steve Garry
  10. 10. 1) HDT  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.  Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the same scale as current more optimized solutions  Challenges:  Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently!) 10 A Linked Data hacker toolkit 431 M.triples~ 63 GB NT + gzip 5 GB HDT 6.6 GB Slightly more but you can query! rdfhdt.org
  11. 11. 11 HDT (Header-Dictionary-Triples) Overview RDF Header Dictionary Triples 1 aa.. 2 ab.. 3 bu .. metadata describing the RDF dataset Mapping between IDs elements in the dataset aa.. ab.. bu .. 3 2 1 Structure of the data after the ID replacement
  12. 12.  Publication Metadata:  Publisher, Public Endpoint, …  Statistical Metadata:  Number of triples, subjects, entities, histograms…  Format Metadata:  Description of the data structure, e.g. Triples Order.  Additional Metadata:  Domain-specific.  … In RDF Header
  13. 13. 13 Header
  14. 14. 14 Dictionary+Triples partition
  15. 15. 15 Dictionary+Triples partition <http://example.org/Vienna> <http://example.org/Javier> <http://example.org/Paul> <http://example.org/Researcher> <http://example.org/Stefan> ex:birthPlace ex:workPlace foaf:mbox foaf:name rdf:type “jfergar@example.org” “jfergar@wu.ac.at” “Vienna”@en 1 2 3 4 5 6 7 8 9 10 11 12 13 2 1 7 12 13 9 4 10 8 5 6 3 6 11 8
  16. 16.  Mapping of strings to correlative IDs. {1..n}  Lexicographically sorted, no duplicates.  Prefix-Based compression in each section.  Efficient IDString operations Dictionary
  17. 17.  Prefix-Based compression used in each section 1. Each string is encoded with two values  An integer representing the number of characters shared with the previous string  A sequence of characters representing the suffix that is not shared with the previous string Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best (0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
  18. 18.  Prefix-Based compression used in each section 2. The vocabulary is split in buckets, each of them storing “b” strings  The first string of each bucket (header) is coded explicitly (i.e. full string)  The subsequent b-1 strings (internal strings) are coded differentially Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,Best) 1 2 3 4 5 6
  19. 19.  Prefix-Based compression used in each section 3. PFC is encoded with a byte sequence and an array of pointers (ptr) to denote the first byte of each bucket Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  20. 20.  Prefix-Based compression used in each section  Locate (string) performs a binary search in the headers + sequential decoding of internal strings  e.g. locate (Antivirus Software)=5  Extract (id), finds the bucket id/b, and decodes until the given position  E.g. extract (5) = Antivirus Software Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., & Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108. ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  21. 21. subjects Objects: Predicates:  Bitmap Triples: 21 Bitmap Triples Encoding  We index the bitsequences to provide a SPO index
  22. 22.  Represent and index large volumes of data  ~ Theoretical minimum space while serving efficient operations:  Mostly based on 3 operations:  Access  Rank  Select 22 Remember…. Succinct Data Structures
  23. 23.  Bitmap Sequence.  Operations in constant time  access(position) = Value.  rank(position) = “Number of ones, up to position”.  select(i) = “Position where the one has i occurrences”.  Implementation:  n + o(n) bits  Adjustable space overhead: In practice, 37,5 % overhead 23 Bit Sequence Coding 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 rank(7) = 4 9 select(5) = 9 1 access(14) = 1
  24. 24.  Bitmap Triples: 24 Bitmap Triples Encoding subjects Objects: Predicates:  E.g. retrieve (2,5,?)  Find the position of the second ‘1’-bit in Bp (select)  Binary search on the list of predicates looking for 5  Note that such predicate 5 is in position 4 of Sp  Find the position of the four ‘1’-bit in Bo (select) S P O S P ? S ? O S ? ? ? ? ?
  25. 25.  E.g. retrieve (?,4,?)  Get the location of the list, Bl.select(3)+1= 4+1 =5  Retrieve the position: Sl[5]=1  Get the associated subject, rank(1)=1st subject  access the object as (1,4,?)  Encoded as a position list 25 Additional Predicate Index predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5 ? P ? subjects Objects: Predicates:
  26. 26.  Encoded as a position list 26 Additional Object Index subjects Objects: Predicates: ? ? O ? P O predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5objects 2 5 6 4 3 3 1 0 0 1 1 1 1 1Bl Sl 1 2 3 4 5
  27. 27.  From the exchanged HDT to the functional HDT-FoQ:  Publish and Exchange HDT  At the consumer: 27 On-the-fly indexes: HDT-FoQ Process Type of Index Patterns index the bitsequences Subject SPO SPO, SP?, S??, S?O, ??? We index the position of each predicate (just a position list) Predicate PSO ?P? We index the position of each object (just a position list) Object OPS ?PO, ??O 1 2 3
  28. 28.  http://www.w3.org/Submission/2011/03/ 28 HDT Acknowledged as W3C member submission:
  29. 29. 29 Some numbers on size http://dataweb.infor.uva.es/projects/hdt-mr/ José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of International Semantic Web Conference (ISWC), 2015 28,362,198,927 Triples
  30. 30.  Data is ready to be consumed 10-15x faster.  HDT << any other RDF format || RDF engine  Competitive query performance.  Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).  Integration with Jena  Joins on the same scale of existing solutions (Virtuoso, RDF3x). 30 Results
  31. 31. 31 rdfhdt.org community
  32. 32. https://github.com/rdfhdt, C++ and Java tools Only in the last two weeks… HDT-cpp
  33. 33. 33 A Linked Data hacker toolkit uses HDT as the main storage solution 2) LOD Laundromat  Challenges:  Still you need to query 650K datasets  Of course it does not contain all LOD, but “a good approximation” http://lodlaundromat.org/ Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
  34. 34. 3) Linked Data Fragments  Challenges:  Still room for optimization for complex federated queries (delays, intermediate results, …) 34 A Linked Data hacker toolkit typically uses HDT as the main engine Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying datasets on the web with high availability. In ISWC (pp. 180-196).
  35. 35. Get more than 650K HDT datasets from LOD Laundromat… PAGE 35
  36. 36. PAGE 36 And query them online
  37. 37.  Scalable storage  Store and serve thousands of (large) datasets (e.g. LOD Laundromat)  Archiving (reduce storage costs + foster smart consumption)  Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)  Better consumer-centric publication  Compress and share ready-to-consume RDF datasets  Consumption with limited resources  smartphones, standard laptops  Fast –low cost- SPARQL Query Engine  Via HDT-Jena  Via Linked Data Fragments Application scenarios (HDT)
  38. 38.  Storage + Light API  http://lodlaundromat.org/  http://linkeddatafragments.org/  Storage + SPARQL Query Engine  https://data.world/  Advance features  Top k Shortest Path  Query Answering over the Web of Data  Others: Versioning, streaming,… Application Examples
  39. 39. Application Examples Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo Uri resolver based on HDT: https://github.com/pharmbio/urisolve 91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
  40. 40. But what about Web-scale queries  E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)  Solutions? 40 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  41. 41. LOD-a-lot 41 But what about Web-scale queries - flashback -
  42. 42. 42 -crawl- LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data -clean- -index&store- SPARQL endpoint (metadata) LOD-a-lot LOD-a-lot-integrate- 28B triples
  43. 43.  Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB  Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS  LDF page resolution in milliseconds. 43 LOD-a-lot (some numbers) 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  44. 44. 44 https://datahub.io/dataset/lod-a-lot http://purl.org/HDT/lod-a-lot LOD-a-lot
  45. 45.  Query resolution at Web scale  Using LDF, Jena  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 45 LOD-a-lot (some use cases) subjects predicates objects
  46. 46.  Identity closure  ?x owl:sameAs ?y  Graph navigations  E.g. shortest path, random walk 46 LOD-a-lot (some use cases) Wouter Beek, Javier D. Fernández and Ruben Verborgh. LOD-a-lot: A Single-File Enabler for Data Science. In Proc. of SEMANTiCS 2017. More use cases:
  47. 47.  Update LOD-a-lot regularly  More and newer datasets from the LOD Cloud  Leverage the HDT indexes to support “data science”  E.g. get links across datasets, study the topology of the network, optimize query planning  Support provenance of the triples (i.e. origin of each triple)  Currently supported only via LOD Laundromat  … implement the use cases and help the community to democratize the access to LOD Roadmap
  48. 48.  We are currently facing Big Linked Data challenges  Generation, publication and consumption  Archiving, evolution…  Thanks to compression/HDT, the Big Linked Data today will be the “pocket” data tomorrow  HDT democratizes the access to Big Linked Data = Cheap, scalable consumers low-cost access to LOD = high-impact research PAGE 48 Take-home messages
  49. 49. 49 ACKs
  50. 50. Thank you! javier.fernandez@wu.ac.at Kudos to all the co-authors involved in the works presented here Incomplete list of ACKs: Miguel A. Martínez-Prieto Mario Arias Pablo de la Fuente Claudio Gutierrez Axel Polleres Wouter Beek Ruben Verborgh … And many others

×