Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez<br />Compact Representation of Large RDF Data Sets for P...
The Motivation<br /><ul><li>LARGE RDF data sets
Syntaxes oriented mainly to represent documents
RDF/XML, N3, Turtle, JSON, etc.
Document-centric data-centricview
Redundancy
No structure(chunks)
Lackof metadata
sequentiality of theinformation
Use?
examples:
Billion Triple 2010 (~3200M triples, 318 gzippedchunks, ~27GB)
Uniprot (~845M, 12 gzippedchunks, ~23GB)</li></ul>Pag 2<br />Image: renjithkrishnan / FreeDigitalPhotos.net<br />
Real World example: Billion Triple 2010<br />Where is the metadata?<br />Who did publish this?<br />Do I have all the data...
Needs<br />Theaims of theformat are: <br /><ul><li>Clean publication
Metadata
Compactness
Efficient exchange
RDF compression
Basic data operations </li></ul>Pag 4<br />Image: jscreationzs / FreeDigitalPhotos.net<br />
HDT Overview<br />HDT<br /><ul><li>Logicaldecomposition of RDF,
Phylosophy of publication and exchange,
Compact RDF representation
basedon 3 maincomponents:  Header, Dictionary and Triples</li></ul>Pag 5<br />
HDT Overview<br />Pag 6<br />
Header<br />Metadatainformationaboutthe RDF collection<br /><ul><li>WhQuestions (what, who, where, how, etc.)
Source and providerinformation
Publication data
Data set statistics
Otherinformation
Information required to retrieve and process the represented data
Location/s, format/s, encoding/s, etc.</li></ul>Pag 7<br />
Header use<br />?<br />Header<br />Header<br />[318]<br />HDT<br />HDT<br />RDF<br />RDF<br />RDF<br />HDT<br />HDT<br />R...
Header in Practice<br />http://purl.org/HDT/hdt#<br />SWP<br />SCOVO, SDMX, hdt<br />Void, DublinCore, etc<br />hdt<br />P...
Upcoming SlideShare
Loading in …5
×

Compact Representation of Large RDF Data Sets for Publishing and Exchange

975 views
907 views

Published on

ISWC 2010 presentation.

Mo

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
975
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Compact Representation of Large RDF Data Sets for Publishing and Exchange

  1. 1. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez<br />Compact Representation of Large RDF Data Sets for Publishing and Exchange<br />
  2. 2. The Motivation<br /><ul><li>LARGE RDF data sets
  3. 3. Syntaxes oriented mainly to represent documents
  4. 4. RDF/XML, N3, Turtle, JSON, etc.
  5. 5. Document-centric data-centricview
  6. 6. Redundancy
  7. 7. No structure(chunks)
  8. 8. Lackof metadata
  9. 9. sequentiality of theinformation
  10. 10. Use?
  11. 11. examples:
  12. 12. Billion Triple 2010 (~3200M triples, 318 gzippedchunks, ~27GB)
  13. 13. Uniprot (~845M, 12 gzippedchunks, ~23GB)</li></ul>Pag 2<br />Image: renjithkrishnan / FreeDigitalPhotos.net<br />
  14. 14. Real World example: Billion Triple 2010<br />Where is the metadata?<br />Who did publish this?<br />Do I have all the data?<br />?<br />[318]<br />PUBLICATION<br />EXCHANGE<br />RDF<br />RDF<br />RDF<br />gzip<br />RDF<br />RDF<br />RDF<br />gzip<br />[318]<br />basicoperations<br />Pag 3<br />
  15. 15. Needs<br />Theaims of theformat are: <br /><ul><li>Clean publication
  16. 16. Metadata
  17. 17. Compactness
  18. 18. Efficient exchange
  19. 19. RDF compression
  20. 20. Basic data operations </li></ul>Pag 4<br />Image: jscreationzs / FreeDigitalPhotos.net<br />
  21. 21. HDT Overview<br />HDT<br /><ul><li>Logicaldecomposition of RDF,
  22. 22. Phylosophy of publication and exchange,
  23. 23. Compact RDF representation
  24. 24. basedon 3 maincomponents: Header, Dictionary and Triples</li></ul>Pag 5<br />
  25. 25. HDT Overview<br />Pag 6<br />
  26. 26. Header<br />Metadatainformationaboutthe RDF collection<br /><ul><li>WhQuestions (what, who, where, how, etc.)
  27. 27. Source and providerinformation
  28. 28. Publication data
  29. 29. Data set statistics
  30. 30. Otherinformation
  31. 31. Information required to retrieve and process the represented data
  32. 32. Location/s, format/s, encoding/s, etc.</li></ul>Pag 7<br />
  33. 33. Header use<br />?<br />Header<br />Header<br />[318]<br />HDT<br />HDT<br />RDF<br />RDF<br />RDF<br />HDT<br />HDT<br />RDF<br />RDF<br />RDF<br />HDT<br />HDT<br />Dictionary &Triples<br />[318]<br />Dictionary &Triples<br />Pag 8<br />
  34. 34. Header in Practice<br />http://purl.org/HDT/hdt#<br />SWP<br />SCOVO, SDMX, hdt<br />Void, DublinCore, etc<br />hdt<br />Pag 9<br />
  35. 35. DBPedia example<br />Pag 10<br />
  36. 36. DBPedia Header<br />Pag 11<br />
  37. 37. DBPedia Header<br />Pag 12<br />
  38. 38. (Basic) Hdt statistics<br /><ul><li>out-degree, deg−(s)
  39. 39. the number of triples of G in which s occurs as subject
  40. 40. deg−(G), deg−(G)
  41. 41. partialout-degree, deg− −(s, p)
  42. 42. the number of triples of G in which s occurs as subject and p as predicate
  43. 43. deg− −(G), deg− −(G)
  44. 44. labeledout-degree, degL−(s)
  45. 45. the number of different predicates (labels) of G with which s is related as a subject in a triple of G
  46. 46. degL−(G), degL−(G)
  47. 47. subject-object ratio, αs−o
  48. 48. the proportion of common subjects and objects in the graph G
  49. 49. αs−o =|SG∩OG| / |SG∪OG|
  50. 50. Symmetrically, in-degrees: deg+(o), deg+(G), etc.</li></ul>Pag 13<br />
  51. 51. DBPedia example<br />out-degree(page1) = 4<br />partialout-degree(page1,#label) = 2<br />labeledout-degree(page1)=3<br />in-degree(page3) = 2<br />partial in-degree(page3,#broader) = 2<br />labeled in-degree(page3)=1<br />out-degree(page2) = 2<br />labeledout-degree(page2)=2<br />Pag 14<br />
  52. 52. HDT<br />Pag 15<br />
  53. 53. Dictionary<br /><ul><li>In general terms, a data dictionary is a centralized repository of information about data.
  54. 54. Currently, in RDF formats:
  55. 55. namespaces and prefixes
  56. 56. Currently, in Triple Stores:
  57. 57. assigns a unique ID to each element in the data set</li></ul>Header<br />Dictionary<br />Pag 16<br />
  58. 58. Dictionary in Practice<br /><ul><li>Subset distinction:
  59. 59. (1) Common subject-objects
  60. 60. (2) The non common subjects
  61. 61. (3) The non common objects
  62. 62. (4) Predicates
  63. 63. List of strings matching the mapping of the four subsets, in order from (1) to (4).
  64. 64. A reserved character is appended to the end of each string and each vocabulary to delimit their size.</li></ul>Pag 17<br />
  65. 65. Dictionary in Practice<br />Pag 18<br />
  66. 66. Dictionary in Practice. Header configuration<br />Pag 19<br />
  67. 67. HDT<br />Pag 20<br />
  68. 68. Triples<br /><ul><li>Contains the structure of the data after the ID replacement.</li></ul>Pag 21<br />
  69. 69. Compact Triples<br />Predicates:<br />Objects:<br />subject 1<br />subject 2<br />subject 3<br />1 2 6 .<br />1 3 2 .<br />2 1 3 .<br />2 2 4 .<br />2 2 5 .<br />2 4 1 .<br />3 3 2 .<br />1 2 6; 3 2 .<br />2 1 3; 2 4, 5; 4 1 .<br />3 3 2 .<br />2 6;<br />3 2<br />2 <br /> 3 0 1 2 4 0<br /> 3 0 1 2 4 030<br /> 3 0 1 2<br /> 3 0<br /> 3 0 1<br />1 3<br />2 4, 5;<br />4 1<br />3 2<br />Adjacency Lists Splitting<br />Subject<br />Grouping<br />6 0 2 0 3 04 501020<br />6 0 2 0 3 04 50<br />6 0 2 0 3 04 5010<br />6 0<br />6 0 2 0<br />6 0 2 0 3 0<br />Compact Triples<br />Pag 22<br />
  70. 70. Predicates:<br />Objects:<br />Predicates<br />Objects<br />subject 2<br />subject 3<br />subject 1<br />Sp<br />2 3<br /> 1 2 4<br /> 3<br />Bp<br />Bitmap Triples<br />2 301 2 4 03 0<br />2 301 2 4 030<br />2 301 2 4 03 0<br />0 01<br />0 0 0 1<br />0 1<br />2 3 01 2 4 03 0<br />6<br /> 2 3 4 5 1 2<br />Bitsequence-based reorganization<br />So<br />6020304501020<br />6020304501020<br />6020304501020<br />01<br />01010 010 10 1<br />Bo<br />Compact Triples<br />Bitmap Triples<br />Pag 23<br />
  71. 71. HDT Operations Over Bitmaps Triples<br /><ul><li>Bitmaps Triples representation allows on-demand loading strategy
  72. 72. take advantage of the structure indexed in Bp and Bo 
  73. 73. accessible by fast rank/ select operations.
  74. 74. sequence S of length n drawn from an alphabet Σ = {0,1}:
  75. 75. ranka(S,i): counts the occurrences of a symbol a ∈ {0,1} in S[1,i].
  76. 76. selecta(S,i): finds the i-th occurrence of symbol a ∈ {0,1} in S.</li></ul>Pag 24<br />
  77. 77. HDT Operations Over Bitmaps Triples<br />Algorithm 1. Check&Find operation for a triple (s,p,o).<br />Thedistribution of lists assures an average cost in O (degL−(G) + deg−−(G))<br />Pag 25<br />
  78. 78. Triples in Practice. Header configuration<br />Pag 26<br />
  79. 79. HDT Bitmap Triples Compression (for exchange)<br />HDT-Plain<br />HDT-Compress<br />Textcompression:<br />gzip, bz2, PPM<br />specificcompression:<br />Huffman (S), RRR (B)<br />Pag 27<br />
  80. 80. HDT Bitmap Triples Compression Results<br />Pag 28<br />
  81. 81. HDT Bitmap Triples Compression Results<br />Uniprot<br />Pag 29<br />
  82. 82. HDT And SPARQL<br /><ul><li>SPARQL can make use of some interesting features in HDT:
  83. 83. Subject-object JOINs resolution can profit from the common naming in the dictionary, as the elements are correctly and quickly localized in the top IDs.
  84. 84. Algorithm 1 can response basic ASK queries of SPARQL for patterns (s,p,o), (s,?p,?o) and (s,p,?o).
  85. 85. Algorithm 1 can response basic CONSTRUCT query of SPARQL for simple WHERE patterns (s,p,o), (s,?p,?o) and (s,p,?o).The resultant is a RDF HDT graph.</li></ul>Note: The S-P-O Adjacency List order is assumed. The Algorithm1 and the response patterns vary for alternative representations S-O-P AL, P-S-O, P-O-S, O-P-S AL and O-S-P AL.<br />Pag 30<br />
  86. 86. Conclusions<br /><ul><li>RDF publication and exchange at large scale are seriously compromised by the scalability drawbacks of current RDF formats
  87. 87. lack of structure, metadata information and native operations over the data
  88. 88. HDT addressestheseproblems (producer, consumer)
  89. 89. Header, Dictionary, and Triples
  90. 90. Triples PracticeImplementation
  91. 91. Compact (HDT-Plain)
  92. 92. Compress (HDT-Compress, outperforms universal compressors)
  93. 93. Check&Find (indexedaccess)</li></ul>Pag 31<br />
  94. 94. Future Work<br /><ul><li>Optimize prototype
  95. 95. Open-source (soon at http://hdt.dcc.uchile.cl/)
  96. 96. RDF native storage
  97. 97. Dynamic structures on secondary memory
  98. 98. Solve SPARQL joins
  99. 99. Multi-Index (size tradeoff)
  100. 100. Extensions
  101. 101. N-Quads
  102. 102. Sparql Endpoints</li></ul>Pag 32<br />
  103. 103. W3C Member Submission<br />Pag 33<br />Image: renjithkrishnan / FreeDigitalPhotos.net<br />
  104. 104. Thanks for your attention.<br />http://hdt.dcc.uchile.cl/<br />rdfhdt@gmail.com<br />Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez<br />

×