Your SlideShare is downloading. ×
Sparksee Technology overview
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sparksee Technology overview

1,475

Published on

Implementation details of Sparksee's graph database, learn how bitmaps store graph information and how this result in a lightweight & high-performance solution.

Implementation details of Sparksee's graph database, learn how bitmaps store graph information and how this result in a lightweight & high-performance solution.

Published in: Data & Analytics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,475
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Sparksee Graph Database! Technology overview! April 2014 º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
  • 2. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Graph Databases! Introduction to Sparksee! Sparksee Internals! Performance analysis! High scalability! HPC-SGAB Benchmark !
  • 3. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Graph Databases!
  • 4. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseGraph Databases Graphs are everywhere! ! — Increasing number of huge networks such as the Web, Social Networks, Biological Systems, GPS…! ! — Very large graphs! ! — Interest for analyzing the ! interrelation between the entities ! in theses networks! !
  • 5. Classical graph representation! ! — Adjacency matrix! ! Very large NxN sparse matrix, no labels, no multigraph, ! no attributes! — Adjacency list! ! No labels, no attributes, still sparse consuming! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseGraph Databases
  • 6. Classical graph storage! — Relational database! ! Prefixed schema or very large table for nodes and edges, not ! ! suitable for path traversals and graph exploration! — XML! ! XML data is stored in the form of trees! ! Much work done on finding exact or approximate patterns ! ! (subtrees)! ! Not thought for complex graph queries! — RDF! ! Widely adopted standard for manipulating graph-like data! ! Large support from large vendors! ! SPARQL has become a de facto standard º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseGraph Databases
  • 7. New approaches to graph analysis! ! — Complex analysis computations on very large distributed graphs ! ! Map-reduce (Pegasus)! ! Vertex-centric computation model (Pregel) ! — Graph Databases: database functionalities to store and query graph-like data ! ! Graph storage in a file system of a computer node with buffer ! ! pool (Neo4j, Hypergraph, OrientDB, Infinitegraph! ! Multiple servers accessible through a load balancer (Neo4j HA) º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseGraph Databases
  • 8. Requirements for graph databases! ! — Data and schema represented as a graph! — Data operations based on graph operations! — Graph-based integrity restrictions! — Multigraphs! — Attributes attached to both vertices and edges! — Graph queries combining edge traversals with attribute ! accesses! — Diversity of workloads! — Efficient secondary memory management! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseGraph Databases
  • 9. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Introduction to Sparksee!
  • 10. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseIntroduction to Sparksee Sparksee! ! IS a high-performance and out-of-core ! graph database management system ! FOR large scale labeled and attributed multigraphs! ! BASED ON vertical partitioning and collections of objects identifiers stored as bitmaps
  • 11. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseIntroduction to Sparksee ! Sparksee — Characteristics! ! — Graph split into small structures ! Move to main memory just significant parts (caching) — Object identifiers (oids) instead of complex objects ! Reduce memory requirements — Specific structures to improve traversals ! Index the edges and the neighbors of each node — Attribute indices ! Improve queries based on value filters — Implemented in C++ ! Different APIs (Java, .NET, etc.) through wrappers
  • 12. ! ! Sparksee — Capabilities! ! Efficiency ! very compact representation using bitmaps. Highly compressible data ! ! structures. Capacity ! more than 100 billion vertices and edges in a single multicore computer. Performance ! subsecond response in recommendation queries. Scalability ! high throughput for concurrent queries. Consistency ! partial transactional support with recovery. Multiplatform ! Linux, Windows, MacOSX, Mobile º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseIntroduction to Sparksee
  • 13. ! Logical graph model! ! Labeled ! a label (type) for each vertex and edge ! Directed ! edges can have a fixed direction, from tail to head ! Attributed ! variable list of attributes for each! ! vertex and edge ! Multigraph ! multiple edges between two ! ! vertices ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseIntroduction to Sparksee
  • 14. ! ! ! Sparksee — Architecture! ! ! ! ! ! ! ! ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseIntroduction to Sparksee GDB GRAPH DATA STRUCTURES PLATFORM DEXCORE SparkseeCpp – Graph Algorithms SWIG SparkseeJava SparkseeNet .NET App JAVA App C++ App BUFFERPOOL Python App Mobile App SparkseePhyton
  • 15. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Sparksee internals
  • 16. ! ! ! Graph representation! ! We define a graph G = (V,E,L,T,H,A1,…,Ap) as: ! LABELS L = {(o, l ) | o ∈ (V ∪ E ) ∧ l ∈ string} TAILS T = {(e, t ) | e ∈ E ∧ t ∈ V } HEADS H = {(e, h) | e ∈ E ∧ h ∈ V } ATTRIBUTES Ai = {(o, c ) | o ∈ (V ∪ E ) ∧ c ∈ {int, string, ...}} ! With this representation: — the graph is split into multiple lists of pairs! — the first element of each pair is always a vertex or an edge! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 17. Graph representation! ! ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals L v1, ARTICLE), (v2, ARTICLE),T (e1, v1), (e2, v2), (e3, v4), (e , v ), (e ,H (e1, v3), (e2, v3), (e3, v3), (e , v ), (e ,Aid (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v , 1), (v , 2)Atitle (v1, Europa), (v2, Europe), (v , Europe),Anlc (v1, ca), (v2, fr), (v3, en), (v , en), (e ,Afilename (v5, europe.png), (v , bcn.jpg)Atag (e4, continent)
  • 18. ! ! Value sets! ! Groups all pairs of the ! original set with the ! same value as a pair ! between the value and ! the set of objects with ! such value. ! ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals L v1, ARTICLE), (v2, ARTICLE), (v3, ARTICLE), (v4, ARTICLE), (v5, IMAGE), (v6, IMAGE), (e1, BABEL), (e2, BABEL), (e3, REF), (e4, REF), (e5, CONTAINS), (e6, CONTAINS), (e7, CONTAINS) (ARTICLE, {v1, v2, v3, v4}), (BABEL, {e1, e2}), (CONTAINS, {e5, e6, e7}), (IMAGE, {v5, v6}), (REF, {e3, e4}) T (e1, v1), (e2, v2), (e3, v4), (e4, v4), (e5, v3), (e6, v3), (e7, v4) (v1, {e1}), (v2, {e2}), (v3, {e5, e6}), (v4, {e3, e4, e7}) H (e1, v3), (e2, v3), (e3, v3), (e4, v3), (e5, v5), (e6, v6), (e7, v6) (v3, {e1, e2, e3, e4}), (v5, {e5}), (v6, {e6, e7}) Aid (v1, 1), (v2, 2), (v3, 3), (v4, 4), (v5, 1), (v6, 2) (1, {v1, v5}), (2, {v2, v6}), (3, {v3}), (4, {v4}) Atitle (v1, Europa), (v2, Europe), (v3, Europe), (v4, Barcelona) (Barcelona, {v4}), (Europa, {v1}), (Europe, {v2, v3}) Anlc (v1, ca), (v2, fr), (v3, en), (v4, en), (e1, en),(e2, en) (ca, {v1}), (en, {v3, v4, e1, e2}), (fr, {v2}) Afilena me (v5, europe.png), (v6, bcn.jpg) (bcn.jpg, {v6}), (europe.png, {v5}) Atag (e4, continent) (continent, {e4})
  • 19. ! Bitmap representation! ! — Each vertex and edge is identified by a unique and immutable ! oid (object identifier) ! — Each vertex or edge set is stored in a bitmap structure: ! Each position in a bitmap corresponds to the oid of an object! ! Reduced amount of space (compression techniques) ! Very efficient binary logic operations º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 20. ! Value set representation! ! — A value set is represented as two maps! ! One maps each different value to a vertex or edge set! ! The other maps each vertex or edge to a value oid ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 21. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals ! Example of a bitmap based representation! ! ! ! ! ! ! ! !
  • 22. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals ! Integrity rules! ! ! ! ! ! ! ! !
  • 23. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals ! ! ! Value set operations! ! domain returns the set of distinct values objects returns the set of vertices or edges associated to a value! lookup returns the set of values ! associated to a set of objects! insert adds a vertex or edge to the ! collection of objects of a value! remove removes a vertex or edge ! from the collection of objects of a value
  • 24. Graph query examples — Number of articles! ! |objects (LABELS, ‘ARTICLE’)| — Out-degree of English article ‘Europe’! ! |objects (TAILS, objects( TITLE, ‘Europe’) ∩ objects (NLC, ‘en’) ∩ objects ! (LABELS, ‘ARTICLE’))| — Articles with references to the image with filename ‘bcn.jpg’ ! ! {lookup(TAILS, x ) |x ∈ objects (HEAD, objects (FILENAME, ′ bcn.jpg′ ) ! ! ! ∩ objects (LABELS, ′ IMAGE′ ))} ! — Count the articles of each language {(x , y ) | x ∈ domain(NLC) ∧ y = |(objects (NLC, x ) ∩ objects (LABELS, ! ! ! ′ ARTICLE′ ))|} º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 25. ! Implementation details — Bitmaps are compressed by grouping the bits into clusters of 32 consecutive bits (up to 137 billion objects per graph)! — Locality is improved by generating consecutive oids for each distinct vertex or edge labels! — Sorted tree structure of bitmap clusters to speedup the insert, remove, and binary logic operations! — Maps are implemented using B+ trees — The tail, head and attribute value sets have been split into specific value sets for each label º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 26. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Performance analysis
  • 27. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Queries! ! ! ! ! ! ! ! ! ! Q1: Find the article with the largest outdegree and traverse its shortest path tree Q2: Recommend articles related to the most popular one Q3: Find new images for articles from translations in other languages Q4: Find, for each different language, the number of articles and images referenced Q5: For each article with images, materialize the count of images Q6: Remove all articles without images Q1 Q2 Q3 Q4 Q5 Q6 k-hops and path traversals + + graph pattern matching + aggregations and edge connectivity + graph transformation + +
  • 28. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Performance Out-of-core! ! Wikipedia Benchmark out-of-core, 1GB buffer pool. ! ! ! ! ! ! ! ! (⋆) Java VM with 45 GB MonetDb MySQL Neo4J* SPARKSEE Graph Size (GB) 12.00 15.72 42.00 16.98 Load (h) Error 1.36 8.99 2.89 Q1 (s) 4,801.6 > 12 h. > 12 h. 120.5 Q2 (s) 3,788.4 13,841.6 > 12 h. 205.4 Q3 (s) 458.9 33.0 481.0 10.8 Q4 (s) 279,3 45.0 > 12 h. 144.9 Q5 (s) 267.4 930.3 > 12 h. 140.9 Q6 (s) Error 10707.0 > 12 h. 25791.6
  • 29. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Query statistics! ! ! ! ! ! ! ! ! ! ! Query results edge trav. edge trav./sec mem MB bitmaps Q1 624,525 236,387,207 1,987,616.30 832.19 42.97% Q2 5 261,735,954 1,270,747.94 2,974.50 48.59% Q3 51,780 1,536,698 143,885.58 320.81 48.00% Q4 254 4,987,879 33,984.32 245.13 77.67% Q5 2,401,597 5,934,724 42,072.39 319.00 80.64% Q6 52,380,949 281,433,106 37,434.27 11,583.88 67.76% !
  • 30. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Bitmap memory usage! ! ! ! ! ! ! ! ! ! ! Size (MB) Q1 Q2 Q3 Q4 Q5 Q6 LABELS 13.56 11.60 11.60 11.60 11.60 11.60 1.51 TAILS 1,272.32 1,030.90 857.09 229.67 164.79 164.79 90.18 HEADS 633.98 506.98 47.09 Attr. ID 122.77 0.85 Attr. TITLE 835.92 10.87 Attr. NLC 3,618.49 791.29 833.64 617.15 Attr. FILENAME 769.79 Attr. TAG 31.94 2.29 TOTAL 7,298.77 1,042.50 1,375.67 1,032.56 1,010.03 176.39 769.94
  • 31. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Analysis of bitmap usage! ! ! ! ! ! ! ! ! ! !
  • 32. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis ! Bitmap size distribution! ! ! ! ! ! ! ! ! ! !
  • 33. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabasePerformance analysis Out of core stress test! ! ! ! ! ! ! ! ! ! !
  • 34. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee technology ! RMAT/Query 1 Scalability Test! ! ! ! ! ! ! ! ! ! ! 228 is out-of-core (2 billion edges)
  • 35. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph Database Sparksee technology SNA Benchmark — Q1, Q6, Q9 and Q12 ! ! ! ! ! ! ! ! !
  • 36. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex High scalability
  • 37. ! ! ! High Scalability test — Mirror Servers in Amazon Elastic with a Load Balancer ! ! ! ! ! ! ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee technology
  • 38. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex HPC-SGAB Benchmark
  • 39. ! Definition — HPC-SGAB: Badet et al. 2009! ! Measured in TEPS: traversed edges per second! — Graph! ! Synthetic (R-Mat)! ! Power law distribution! ! Average: 8 edges/node — Operations! ! ! Kernel 1: load graph and create indexes! ! ! Kernel 2: find the edge(s) with maximum weight! ! ! Kernel 3: k-hops! ! ! Kernel 4: betweenness centrality (Brandes algorithm) º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 40. ! ! Experimental setup — Systems Tested! ! Sparksee (former DEX)! ! Neo4j! ! HypergraphDB! ! Jena (RDF) — Platform! ! Single computer with 2 quad core Xeon E5410! ! 11GB RAM! ! LFF 2.25 TB disk! ! Single threaded — Default benchmark configuration º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 41. Summary of results ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 42. Kernel 1 - Load time ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 43. Kernel 4 - Betweenness centrality ! ! ! ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee internals
  • 44. ! ! Bibliography! ! R. Angles, A. Prat, D. Dominguez, J.L. Larriba, Benchmarking database systems for social network applications (GRADES 2013) ! N. Martínez, V. Muntés, S. Gómez, M.A. Águila, D. Dominguez, J.L. Larriba, Efficient Graph Management Based On Bitmap Indices (IDEAS 2012) ! N. Martínez, S. Gómez, F. Escalé, DEX: a High-Performance Graph Database Management System (GDM 2011) ! D. Dominguez, P. Urbón, A. Giménez, S. Gómez, N. Martínez, and J. L. Larriba, Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark (IWDG 2010) ! N. Martínez, V. Muntés, S. Gómez, J. Nin, M. A. Sánchez, and J. Larriba, Dex: High-performance Exploration on Large Graphs for Information Retrieval (CIKM 2007) ! ! º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com º Sparksee Graph DatabaseSparksee technology
  • 45. º *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com Sparksee Graph DatabaseIndex Thanks! Q&A!

×