Avoiding Deadlocks: Lessons Learned with Zephyr Health Using Neo4j and MongoDB - Mahesh Chaudhari @ GraphConnect SF 2013

3,662 views
3,632 views

Published on

Z-Platform is the new innovative powerful and complex platform to ingest data of any kind and store the data in the form of JSON documents in MongoDB and represent a sparse representation of the same in Neo4j graph database. Mahesh discusses how he tackled deadlocks and improved the performance of the system significantly. The test environment included small graphs (ranging up to 10000 relationships to very large graphs (ranging up to 39 million relationships). The average performance of the system is 3741 relationships per minute.

Published in: Technology, Health & Medicine
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,662
On SlideShare
0
From Embeds
0
Number of Embeds
1,077
Actions
Shares
0
Downloads
32
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • 11 Properties on each node and 11 properties on each edge.
  • Avoiding Deadlocks: Lessons Learned with Zephyr Health Using Neo4j and MongoDB - Mahesh Chaudhari @ GraphConnect SF 2013

    1. 1. Avoiding Deadlocks in Neo4j on Z-Platform - Mahesh Chaudhari, Cesar Arevalo & Brian Roy
    2. 2. Outline • • • • • • Introduction to the Z-Platform Problems caused by Deadlocks Locks and Deadlocks in Neo4j Avoidance using Bipartite graphs Performance Conclusion 2
    3. 3. Full Entity Profiles Z-Platform Sparse Representatio n of Profiles MongoDB Database Json Documents Nodes & Edges Neo4j Graph Database Z-Platform Source Datasets 3
    4. 4. Deadlocks in Z-Platform • Creating relationships is one of the most time consuming processes • Log analysis reveals deadlocks among batch transactions and retry-mechanism takes time • Dependent on how nodes and relationships are grouped together • Batch size is dependent on the size of the JSON block sent to the server • Time required to build relationships and resolve deadlocks is in the order of seconds 4
    5. 5. Locks in Neo4j • Create a Node n1  Write Lock on Node n1 • Update a Node n1  Write Lock on Node n1 but read available on Node n1 • Create a Relationship r1 between nodes n1 and n2  Write Locks on relationship r1, n1 and n2 5
    6. 6. Deadlocks across processes A B P1 C D • Processes: P1 and P2 • Nodes: A, B, C, D • Relationships: R1, R2, R3, R4 P2 R1 A No Deadlocks B R4 A C B R2 No Deadlocks C R1 R1 A P1 P1 A B P1 R3 A R3 D P2 A D P2 Possibility of Deadlock D R2 C Deadlock 6 P2 D
    7. 7. Deadlocks across Transactions • Transactions are also like separate processes but in a single thread or multiple threads • Deadlocks occur across transactions – Two concurrent transactions need write locks on the same node n1 – In two concurrent transactions, T1 has write lock on node n1 and waiting on write lock on node n2 whereas T2 has write lock on n2 and is waiting for write lock on node n1 – Transactions of varying sizes 7
    8. 8. Concurrent Transactions Deadlocks A B T1 A B T1 C D T2 A D T2 No Deadlocks Possibility of Deadlock 8
    9. 9. Sequential Asynchronous Transactions Deadlocks A E B . . n edges . F A T1 E B A . . n edges . F D T1 E C No Deadlocks D T2 A T1 D T2 A F . . n edges . B No Deadlock Deadlock 9 T2
    10. 10. Deadlocks Detection and Avoidance • Deadlocks Detection – Only possible at run-time – Recovery from deadlock is either to abort or retry • Deadlocks Avoidance – Reorder the operations to lower or eliminate the likelihood of deadlocks • Graph Clustering Algorithms: Most of them require knowledge of entire graph Clustering Relationships  Bipartite Graphs 10
    11. 11. Bipartite Graphs • Given a Graph G with Vertices V and Edges E, then graph G is a bipartite graph such that vertices V can be partitioned into two independent sets V1 and V2. V1 A V2 A E D C C D E B B 11
    12. 12. Creating Bipartite Graphs • Use two colors to color each node such that no two adjacent nodes have the same color. 1 2 A V1 V2 A E D C C D E B B 12
    13. 13. Non-Bipartite Graphs 1 2 V1 A E V2 A D C C D E B B 13
    14. 14. Algorithm to generate Graph V1 V2 A D C E B • Create all the nodes • Create batches of relationships among the same colored nodes • Create batches of relationships across the two colors 14
    15. 15. Algorithm in Z-Platform • Batch of relationships R = {r1, r2, r3….. rn} : – each r is a triplet {src, dest, props} where src and dest are nodes and props is a set of key-value pairs • Color the nodes based on each relationship with two colors • Mark the conflicting edges where both the src and dest nodes are of the same color • Batch these relationships together in a single batch • Start grouping the remaining edges such that no two batches have any node in common 15
    16. 16. Performance – Test Setup • • • • • JDK 1.7 Neo4j Java Binding Rest API Neo4j Enterprise Server 1.9 Batch size (configurable) : 2000 Test Program that generates random nodes (max 1000) and relationships (max 10,000) • Huge file that contains 10,226 nodes and 39,564,960 relationships (5 GB) 16
    17. 17. Performance – Creating Nodes Time in seconds for Nodes 1.8 1.6 1.4 1.2 1 Time in Secs 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 • 10,226 Nodes: 5.07 seconds • Average Time for 2000 Nodes: 0.99 seconds ~ 1 second • Each Node has 11 properties 17
    18. 18. Performance – Creating Relationships Time in Seconds for relationships 1.4 1.2 1 0.8 Time in Secs 0.6 0.4 0.2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 • 1,74,000 Relationships created in 47.16 seconds • Average Time for 2000 relationships: 0.54 seconds • Number of relationships per second: 3,689 18
    19. 19. Performance – Creating 39 Million Relationships • 39,564,960 Relationships in : 10,573.56 seconds (2 hrs 56 mins 13 seconds) • Average Time for 2000 relationships: 0.53 seconds • Number of relationships per second: 3,741 19
    20. 20. Graph Visualized in the Neo4j 2.0 20
    21. 21. Future Work • Test performance over the network using Amazon EC2 servers to mimic real world setup • Single threaded application  multi-threaded to see if better performance – More complex algorithm to batch relationships together – Analyze if the complexity is worth the performance improvement • Vary multiple factors: – Batch size : 1000 to 4000 – Properties (relationship descriptors) : 2 – 20 • Dispatcher Pattern to facilitate the single point distribution of nodes and relationships to threads/Transactions 21
    22. 22. Conclusion • Deadlocks in general are time consuming and difficult to detect and prevent • Use of graph coloring to partition graph into conflicting and non-conflicting edges • Successful prototype tests shows significant improvement in building relationships varying from small number to a very large number 22
    23. 23. Dr. Mahesh Chaudhari Sr. Software Engineer +1 602 524 0610 mahesh@zephyrhealthinc.com jobs@zephyrhealthinc.com 23
    24. 24. Contact Information Sven Junkergård Brian Roy Director of Technology +1 415 503 7412 sven@zephyrhealthinc.com Director of Platform Engineering & Architect +1 415 663 6919 brian@zephyrhealthinc.com Zephyr Health Inc. 589 Howard St. 3rd Flr. San Francisco, California 94105 +1.415.529.7649 zephyrhealthinc.com 24

    ×