Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to Exchange and Consumption of Huge RDF Data(20)

Advertisement

Exchange and Consumption of Huge RDF Data

  1. Digital Enterprise Research Institute www.deri.ie Exchange and Consumption of Huge RDF Data Miguel A. Martínez-Prieto1,2 <migumar2@infor.uva.es> Mario Arias1,3 <mario.arias@deri.org> Javier D. Fernández1,2 <jfergar@infor.uva.es> 1. Department of Computer Science, Universidad de Valladolid (Spain) 2. Department of Computer Science, Universidad de Chile (Chile) 3. Digital Enterprise Research Institute, National University of Ireland Galway Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  2. Sharing RDF in the Web of Data. Digital Enterprise Research Institute www.deri.ie Parsing / Indexing Reasoning R • Dataset analysis. I • Setup a SPARQL server. P • Vocabulary interlinking / integration. • Browsing and Visualization. sensor • Exchange between servers • Data-intensive tasks. dereferenceable URIs RDF dump SPARQL Endpoints/ APIs
  3. Dataset Exchange Workflow Digital Enterprise Research Institute www.deri.ie 1º 2º 3º Publication Exchange Consumption Convert Transfer Decompress If RDF is meant to be machine processable, Serialize Parse Why are we using plain text serialization formats?? Compress Index
  4. HDT: RDF Binary Format Digital Enterprise Research Institute www.deri.ie  Compact Data Structure for RDF.  W3C Submission. http://www.w3.org/Submission/2011/03/  Open Source C++/Java library.
  5. HDT Focused on Querying Digital Enterprise Research Institute www.deri.ie FoQ  Contribution of this paper:  A complementary Index to make the HDT fully queryable.  Analysis on how HDT reduces exchange and indexing time.  Evaluate in-memory query performance.
  6. Dictionary Digital Enterprise Research Institute www.deri.ie  Mapping of strings to correlative IDs. {1..n}  Lexicographically sorted, no duplicates.  Section compression explained at [8]
  7. Triples Model Digital Enterprise Research Institute www.deri.ie Triples S 1 2 3 126 132 213 P[ 2 3] [ 1 2 ] [4 ] 3 224 225 O[ 6 ][ 2] [ ][ 3 4 ] [5 ] [1 ] 2 241 332
  8. Adjacency Lists Digital Enterprise Research Institute www.deri.ie 1 2 3 [ 2 , 3] [ , 1 ,2 ] [4 ] 3 1 2 3 4 5 6 Array 2 3 1 2 4 3 Bitmap 1 0 1 0 0 1  Operations: – access(g) = Given a global position, get the value. O(1) – findList(g) = Given a global position, get the list number. O(1) O(log log n) – first(l), last(l), = Given a list, find the first and last.
  9. Triples Model and Coding Digital Enterprise Research Institute www.deri.ie Triples S 1 2 3 126 132 213 P 2 3 1 2 4 3 224 225 O 6 2 3 4 5 1 2 241 Array Y 2 3 1 2 4 3 332 Bitmap Y 1 0 1 0 0 1 Array Z 6 2 3 4 5 1 2 Bitmap Z 1 1 1 1 0 1 1
  10. Searching by Subject Digital Enterprise Research Institute www.deri.ie Triples S 1 ( 2, 2, ? ) 2 3 126 132 213 P 2 3 1 2 4 3 224 225 O 6 2 3 4 5 1 2 241 Array Y 2 3 1 2 4 3 332 Bitmap Y 1 0 1 0 0 1 SPO, SP? Array Z 6 2 3 4 5 1 2 S??, S?O Bitmap Z 1 1 1 1 0 1 1
  11. Searching by Predicate Digital Enterprise Research Institute www.deri.ie Triples S 1 ( ?, 2, ? ) 2 3 126 132 213 P 2 3 1 2 4 3 224 225 O 6 2 3 4 5 1 2 241 Array Y 2 3 1 2 4 3 332 Bitmap Y 1 0 1 0 0 1 ?P? Array Z 6 2 3 4 5 1 2 Bitmap Z 1 1 1 1 0 1 1
  12. Wavelet Tree Digital Enterprise Research Institute www.deri.ie  Compact Sequence of Integers {0,σ}. rank(3, 7) = 2 2 3 6 3 6 1 2 1 3 6 2 5 2 4 1 4 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 9 16 select(6, 3) = 9  access(position) = Value at position.  rank(entry, position) = Number of appearances of O(log σ) O(log σ) “entry” up to “position”. O(log σ)  select(entry, i) = Position where “entry” appears for the i-th time.
  13. Searching by Predicate w/ Wavelet Digital Enterprise Research Institute www.deri.ie Triples S 1 ( ?, 2, ? ) 2 3 126 132 213 P 2 3 1 2 4 3 224 225 O 6 2 3 4 5 1 2 241 Wavelet Y 2 3 1 2 4 3 332 Bitmap Y 1 0 1 0 0 1 ?P? Array Z 6 2 3 4 5 1 2 Bitmap Z 1 1 1 1 0 1 1
  14. Triples: Object-Search Digital Enterprise Research Institute www.deri.ie Triples S 1 ( ?, ?, 2 ) 2 3 126 132 213 P 2 3 1 2 4 3 224 225 O 6 2 3 4 5 1 2 241 332 ??O OP-Index [ 6 ][ 2 ][ 7 ]3[ ] [4 ] [5 ] 1 ?PO O1 O2 O3 O4 O5 O6
  15. Data Structure Summary. Digital Enterprise Research Institute www.deri.ie  From HDT to HDT-FoQ:  Convert Array Y to Wavelet.  Generate OP-Index.  Triple Patterns: SPO, SP?, S??, S?O Original HDT ?P? Wavelet Tree ?PO, ??O OP-Index
  16. Evaluation Environment Digital Enterprise Research Institute www.deri.ie Dataset Triples Size NTriples LinkedMDB 6,1M 850 Mb DBLP 73M 11,1 Gb Geonames 112M 12,3 Gb Producer: Consumer: DBPedia 258M 37,3 Gb Xeon @ 2.4Ghz Phenom-II @ 3.2Ghz Datasets 96GB RAM 8GB RAM Compressors: RDF Storage • GZIP • Virtuoso • LZMA • RDF-3x • Hexastore
  17. Compression Ratio Digital Enterprise Research Institute www.deri.ie DBPedia Geonames hdt gz DBLP lzma hdt.gz LinkedMDB hdt.lzma 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Compression ratio (% against plain ntriples)
  18. Publication Times Digital Enterprise Research Institute www.deri.ie NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min DBPedia 15,9 min 466 min 56 min 60 min 121 min dbpedia geonames dblp linkedMDB 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Times slower than Ntriples+GZIP gz lzma hdt hdt.gz hdt.lzma
  19. Publication Times2 Digital Enterprise Research Institute www.deri.ie NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min DBPedia 15,9 min 466 min 56 min 60 min 121 min dbpedia geonames dblp linkedMDB 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Times slower than Ntriples + GZIP gz hdt hdt.gz hdt.lzma
  20. Exchange & Decompression Time Digital Enterprise Research Institute www.deri.ie GZIP LZMA HDT+GZIP HDT+LZMA Exchange Decompress 0 50 100 150 200 250 300 Seconds (Geometric Mean of all datasets) *Assuming a Network Bandwidth of 2MByte/s
  21. Overall Client Time Digital Enterprise Research Institute www.deri.ie LZMA+Virtuoso GZ+Virtuoso Exchange LZMA+RDF3x Decompress Index GZ+RDF3x LZMA+RDF3x HDT+LZMA linkedMDB 2,1 min 9,21 sec HDT+LZMA+FOQ dblp 27 min 2,02 min geonames 49,2 min 3,04 min HDT+GZIP+FOQ dbpedia 159 min 14,3 min 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 Seconds (Geometric mean of all datasets)
  22. In-memory Data Store. Digital Enterprise Research Institute www.deri.ie Triples Index Size (Mb) Virtuoso Hexastore RDF3x HDT-FoQ LinkedMDB 6,1M 518 6976 337 68 DBLP 46M 3982 - 3252 850 Geonames 112M 9216 - 6678 1435 DBPedia 258M - - 15802 5260  Less size = more data in memory = less I/O access!
  23. Query Performance, Triple Patterns Digital Enterprise Research Institute www.deri.ie LinkedMDB Geonames 16 16 15 15 14 14 RDF-3x 13 13 Virtuoso 12 12 11 11 Times HDT Faster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 SP? S?O S?? ?PO ?P? ??O SP? S?O S?? ?PO ?P? ??O
  24. Query Performance Two-way Joins Digital Enterprise Research Institute www.deri.ie LinkedMDB Geonames 3 3 RDF-3x Virtuoso 2.5 2.5 2 2 Times HDT Faster 1.5 1.5 1 1 0.5 0.5 0 0 SSbig SSsmall SObig SOsmall OObig OOsmall SSbig SSsmall SObig SOsmall OObig OOsmall
  25. Conclusions Digital Enterprise Research Institute www.deri.ie  Data is ready to be consumed 10-15x faster.  Exchange time reduced.  Indexing burden on server = Lightweight client processing.  Competitive query performance.  Very fast on triple patterns.  Joins on the same scale of existing solutions.  This is useful to you:  If you need a fast, compact read-only in-memory RDF store.  If you want to share self-queryable RDF dumps.  If you need fast download & query.  Addresses the volume issue of Big Data.
  26. Future work. Digital Enterprise Research Institute www.deri.ie  Full SPARQL support.  UNION, Optional, Multiple Join.  Optimized query evaluation.  Integration:  Jena, Any23…  Diffussion.  Get more people to use it!  Additional services on top of HDT.  SPARQL Endpoint.  Distributed Stream Processing.  Mobile Applications.
  27. Thanks! http://www.rdf-hdt.org Digital Enterprise Research Institute www.deri.ie

Editor's Notes

  1. Importance of exchange. The Web is for exchanging data. Data flows between nodes. We are in the “Big Data era” We need fast speed, from the network to the application layers.Role of providers / Consumers.Consumption =~ QueryingHow data is shared:Dereferenceable URIs.SPARQL Endpoints.Big datasets: RDF dump. ( Similar to XML, PDF ).Examples where RDF dumps are important: - Setup a mirror. - Overloaded SPARQL Server. - Data analysis. - Vocabulary integration. - Download instead of crawl. - Visualization.Opens new applications. - Processing intensive. - Cooperating applications.
  2. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  3. CPUs are fast, memory/bandwidth are precious.Variable-length.Compression.Compact In-memory representations.
  4. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  5. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  6. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  7. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  8. Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
  9. DatasetsServersData stores.CompilerCompressors.GZIPLZMA
  10. From NTRIPLES to XXXFrom a data store could be faster (Already sorted).
  11. Includes dictionary!!!Great for mobile.
Advertisement