dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

  • 373 views
Uploaded on

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model......

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute).

http://diuf.unifr.ch/main/xi/diplodocus/

More in: Science , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
373
On Slideshare
373
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Short and Long-Tail RDF Analytics for Massive Webs of Data Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland International Semantic Web Conference 26th October 2011, Bonn, Germany
  • 2. Motivation ● increasingly large semantic/LoD data sets ● increasingly complex queries ○ real time analytic queries ■ like “returning professor who supervises the most students” urgent need for more efficient and scalable solution for RDF data management
  • 3. 3 recipes to speed-up
  • 4. 3 recipes to speed-up ○collocation
  • 5. 3 recipes to speed-up ○collocation ○collocation
  • 6. 3 recipes to speed-up ○collocation ○collocation ○collocation
  • 7. Why collocation?? Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.
  • 8. Outline ● architecture ● main idea ● data structures ● basic operations (inserts, queries) ● evaluation & results ● future work
  • 9. System Architecture
  • 10. Main Idea - Hybrid Storage
  • 11. Main Idea - data structures
  • 12. Declarative Templates
  • 13. Template Matching
  • 14. Molecule Clusters ● extremely compact sub-graphs ● precomputed joins
  • 15. List of Literals ● extremely compact list of sorted values
  • 16. Hash Table lexicographic tree to encode URIs template based indexing extremely compact lists of homologous nodes
  • 17. Basic operations - inserts n-pass algorithm
  • 18. Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0. ?x type Student. ?x takesCourse Course0. ?x takesCourse Course1. => intersection of sorted lists
  • 19. Basic operations - queries - molecule queries ?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.
  • 20. Basic operations - queries aggregates and analytics ?x type Student. ?x age ?y filter (?y < 21)
  • 21. Performance Evaluation We used the Lehigh University Benchmark. We generated two datasets, for 10 and 100 Universities. ● 1 272 814 distinct triples and 315 003 distinct strings ● 13 876 209 distinct triples and 3 301 868 distinct strings We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench. ● returning professor who supervises the most students ● returning big molecule containing everything around Student0 within scope 2 ● returning names for all graduate students
  • 22. Results - LUBM - 10 Universities
  • 23. Results - LUBM - 100 Universities
  • 24. Results - analytic 10 Universities
  • 25. Results - analytic 100 Universities
  • 26. Future work ● open source ○ cleaning code ○ extending code ● parallelising operations ○ multi-core architecture ○ cloud ● automated database design
  • 27. Conclusions ● advanced data collocation ○ molecules, RDF sub-graphs ○ lists of literals, compact sorted list of values ○ hash table indexed by templates ● slower inserts and updates ○ compact ordered structures ○ data redundancy ● 30 times faster on LUBM queries ● 350 times faster on analytic queries
  • 28. Thank you for your attention
  • 29. Update Manager - lazy updates
  • 30. Transitivity ● Inheritance Manager ○ typeX subClassOf ● Query ○ ?z type typeY ■ ?z type typeY ■ ?z type typeX ● subClassOf ● subPropertyOf typeY
  • 31. Serialising Molecules #TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE #TEMPLATES - the number of templates in the molecule TEMPLATE_SIZE - the size of a key in bytes #TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)