Short and Long-Tail RDF Analytics for
Massive Webs of Data
Marcin Wylot, Jigé Pont, Mariusz Wiśniewski,
and Philippe Cudré...
Motivation

● increasingly large semantic/LoD data sets
● increasingly complex queries
○ real time analytic queries
■ like...
3 recipes to speed-up
3 recipes to speed-up

○collocation
3 recipes to speed-up

○collocation
○collocation
3 recipes to speed-up

○collocation
○collocation
○collocation
Why collocation??
Because by collocating data together we
can reduce IO operations, which are
one of the biggest bottlenec...
Outline
● architecture
● main idea
● data structures
● basic operations (inserts, queries)
● evaluation & results
● future...
System Architecture
Main Idea - Hybrid Storage
Main Idea - data structures
Declarative Templates
Template Matching
Molecule Clusters
● extremely compact sub-graphs
● precomputed joins
List of Literals
● extremely compact list of sorted values
Hash Table
lexicographic tree
to encode URIs

template based
indexing

extremely compact lists of
homologous nodes
Basic operations - inserts
n-pass algorithm
Basic operations - queries - triple patterns
?x type Student.
?x takesCourse Course0.

?x type Student.
?x takesCourse Cou...
Basic operations - queries - molecule queries

?a name 'Student1'.
?a ?b ?c.
?c ?d ?e.
Basic operations - queries
aggregates and analytics
?x type Student.
?x age ?y
filter (?y < 21)
Performance Evaluation
We used the Lehigh University Benchmark.
We generated two datasets, for 10 and 100 Universities.
● ...
Results - LUBM - 10 Universities
Results - LUBM - 100 Universities
Results - analytic 10 Universities
Results - analytic 100 Universities
Future work
● open source
○ cleaning code
○ extending code
● parallelising operations
○ multi-core architecture
○ cloud
● ...
Conclusions
● advanced data collocation
○ molecules, RDF sub-graphs
○ lists of literals, compact sorted list of values
○ h...
Thank you for
your attention
Update Manager - lazy updates
Transitivity

● Inheritance Manager
○ typeX subClassOf

● Query
○ ?z type typeY
■ ?z type typeY
■ ?z type typeX

● subClas...
Serialising Molecules

#TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE
#TEMPLATES - the number of templates in the molecu...
Upcoming SlideShare
Loading in...5
×

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

239

Published on

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute).

http://diuf.unifr.ch/main/xi/diplodocus/

Published in: Science, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
239
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

  1. 1. Short and Long-Tail RDF Analytics for Massive Webs of Data Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland International Semantic Web Conference 26th October 2011, Bonn, Germany
  2. 2. Motivation ● increasingly large semantic/LoD data sets ● increasingly complex queries ○ real time analytic queries ■ like “returning professor who supervises the most students” urgent need for more efficient and scalable solution for RDF data management
  3. 3. 3 recipes to speed-up
  4. 4. 3 recipes to speed-up ○collocation
  5. 5. 3 recipes to speed-up ○collocation ○collocation
  6. 6. 3 recipes to speed-up ○collocation ○collocation ○collocation
  7. 7. Why collocation?? Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.
  8. 8. Outline ● architecture ● main idea ● data structures ● basic operations (inserts, queries) ● evaluation & results ● future work
  9. 9. System Architecture
  10. 10. Main Idea - Hybrid Storage
  11. 11. Main Idea - data structures
  12. 12. Declarative Templates
  13. 13. Template Matching
  14. 14. Molecule Clusters ● extremely compact sub-graphs ● precomputed joins
  15. 15. List of Literals ● extremely compact list of sorted values
  16. 16. Hash Table lexicographic tree to encode URIs template based indexing extremely compact lists of homologous nodes
  17. 17. Basic operations - inserts n-pass algorithm
  18. 18. Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0. ?x type Student. ?x takesCourse Course0. ?x takesCourse Course1. => intersection of sorted lists
  19. 19. Basic operations - queries - molecule queries ?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.
  20. 20. Basic operations - queries aggregates and analytics ?x type Student. ?x age ?y filter (?y < 21)
  21. 21. Performance Evaluation We used the Lehigh University Benchmark. We generated two datasets, for 10 and 100 Universities. ● 1 272 814 distinct triples and 315 003 distinct strings ● 13 876 209 distinct triples and 3 301 868 distinct strings We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench. ● returning professor who supervises the most students ● returning big molecule containing everything around Student0 within scope 2 ● returning names for all graduate students
  22. 22. Results - LUBM - 10 Universities
  23. 23. Results - LUBM - 100 Universities
  24. 24. Results - analytic 10 Universities
  25. 25. Results - analytic 100 Universities
  26. 26. Future work ● open source ○ cleaning code ○ extending code ● parallelising operations ○ multi-core architecture ○ cloud ● automated database design
  27. 27. Conclusions ● advanced data collocation ○ molecules, RDF sub-graphs ○ lists of literals, compact sorted list of values ○ hash table indexed by templates ● slower inserts and updates ○ compact ordered structures ○ data redundancy ● 30 times faster on LUBM queries ● 350 times faster on analytic queries
  28. 28. Thank you for your attention
  29. 29. Update Manager - lazy updates
  30. 30. Transitivity ● Inheritance Manager ○ typeX subClassOf ● Query ○ ?z type typeY ■ ?z type typeY ■ ?z type typeX ● subClassOf ● subPropertyOf typeY
  31. 31. Serialising Molecules #TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE #TEMPLATES - the number of templates in the molecule TEMPLATE_SIZE - the size of a key in bytes #TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×