Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

17,823 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
17,823
On SlideShare
0
From Embeds
0
Number of Embeds
2,003
Actions
Shares
0
Downloads
59
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Two users: Alice and BobPut command – store “Alice Follows Bob”Use the Address Table to determine the virtual machine (node) that hosts Alice’s dataWrite the data to that nodeUse the address table to determine the node that hosts Bob’s dataWrite the same data to that nodeThe on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
  • Nodes in the logical layer have to handle varying demandsIf one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitThe grid is used for replicationIt is used for efficient locking and messagingOnce the split is complete, new physical machines can be turned onVirtual nodes can be transferred to these new machinesSimilarly, as load decreases virtual nodes can be transferred off of physical machinesSome physical machines can then be shut down to save powerVirtual nodes can be merged back together
  • Works by improving partitions – doesn’t create them from scratchOn-line means that it works with a changing graph – structure frequently changes
  • The algorithm runs in parallel on each node When a split or merge occurs When load is below a thresholdEach vertex is considered in turn Find the number of edges to each node Edges can be weighted Find the node with the greatest no. edges If different, and the gain is > threshold, move vertex
  • Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

    1. 1. Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud<br />Alexander G. Connor<br />Panos K. Chrysanthis<br />AlexandrosLabrinidis<br />Advanced Data Management Technologies Laboratory<br />Department of Computer Science<br />University of Pittsburgh<br />
    2. 2. Data in social networks<br />A social network manages user profiles, updates and connections<br />How to manage this data in a scalable way?<br />Key-value stores offer performance under high load<br />Some observations about social networks<br />A profile view usually includes data from a user’s friends<br />Spatial locality<br />A friend’s profile is often visited next<br />Temporal locality<br />Requests might ask for updates from several users<br />Web pages might include pieces of several user profiles<br />A single request requires connecting to many machines<br />
    3. 3. Connections in a Social Network<br />Alice<br />
    4. 4. Leveraging Locality<br />Can we take advantage of the connections?<br />What if we stored connected user’s profiles and data in the same place?<br />Locality can be leveraged <br />The number of connections is reduced<br />User data can be pre-fetched<br />We can think of this as a graph partitioning problem…<br />Partitions = machines<br />Vertices = user profiles, including update<br />Edges = connections<br />Objective: minimize the number of edges that cross partitions<br />
    5. 5. Example – graph partitioning<br /><ul><li>Many edges cross partitions
    6. 6. Accessing a vertex’s neighbors requires accessing many partitions
    7. 7. In a social network, requesting updates from followed users requires connecting to many machines
    8. 8. Far fewer edges cross partitions
    9. 9. Accessing a vertex’s neighbors requires accessing few partitions
    10. 10. In a social network, fewer connections are made and related user data can be pre-fetched</li></li></ul><li>Key-Key-Value Stores<br />Our proposed approach: extend the key-value model<br />Data can be stored key-values<br />User profiles<br />Data can also be stored as key-key-values<br />User connections<br />“Alice follows Bob”<br />Use key-key-values to compute locality<br />On-line graph partitioning algorithm<br />Assign keys to grid locations based on connections<br />Each grid cell represents a data host<br />Keys that are related are kept together<br />
    11. 11. Outline<br />Introduction<br />Data in Social Networks<br />Leveraging Locality<br />Key-Key-Value Stores<br />System Model<br />Client API<br />Adding a Key-Key-Value<br />Load management<br />On-line partitioning algorithm<br />Simulation Parameters<br />Results<br />Conclusion<br />
    12. 12. Address Table: Mapping Store<br /><ul><li>a transactional, distributed hash table
    13. 13. maps keys to virtual machines</li></ul>Physical Layer: Physical machines<br /><ul><li>can be added or removed dynamically as demands change</li></ul>Logical Layer: Virtual machines<br /><ul><li>Organized as a square grid
    14. 14. Run the KKV store software
    15. 15. Manage replication
    16. 16. Can be moved between physical machines as needed</li></ul>Application Layer: Client API<br /><ul><li>maintain client sessions
    17. 17. cached data</li></ul>Application Sessions<br />Address table<br />Virtual hosts<br />Physical hosts<br />
    18. 18. Client API and Sessions<br />Clients use a simple API that includes the get, put and sync commands<br />Data is pulled from the logical layer in blocks<br />Groups of related keys<br />The client API keeps data in an in-memory cache<br />Data is pushed out asynchronously to virtual nodes in blocks<br />Push/pull can be done synchronously if requested by the client<br />Offers stronger consistency at the cost of performance<br />
    19. 19. Adding a key-key-value<br />put(alice, bob, follows)<br />The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected<br />Two users: Alice and Bob<br />Write the data to that node<br />Write the same data to that node<br />Use the Address Table to determine the virtual machine (node) that hosts Alice’s data<br />Use the address table to determine the node that hosts Bob’s data<br />Address table<br />bob<br />8,8<br />8,8<br />alice<br />1,1<br />Virtual hosts<br />kv(bob, ...)<br />...<br />kkv(alice, bob, follows)<br />kv(alice, ...)<br />...<br />kkv(alice, bob, follows)<br />1,1<br />8,8<br />
    20. 20. Once the split is complete, new physical machines can be turned on<br /><ul><li>Virtual nodes can be transferred to these new machines</li></ul>If one node becomes overloaded, it can initiate a split<br />To maintain the grid structure, nodes in the same row and column must also split<br />Virtual hosts<br />Splitting a Node<br />
    21. 21. Outline<br />Introduction<br />Data in Social Networks<br />Leveraging Locality<br />Key-Key-Value Stores<br />System Model<br />Client API<br />Adding a Key-Key-Value<br />Load management<br />On-line Partitioning Algorithm<br />Simulation Parameters<br />Results<br />Conclusion<br />
    22. 22. On-line Partitioning Algorithm<br />Runs periodically in parallel on each virtual node<br />Also after a split or merge<br />For each key stored on a node<br />Determine the number of connections (key-key-values) with keys on other nodes<br />Can also be sum of edge weights<br />Find the node that has the most connections<br />If that node is different than the current node<br />If the number of connections to that node is greater than the number of connections to the current node<br />If this margin is greater than some threshold<br />Move the key to the other node<br />Update the address table<br />Designed to work in a distributed, dynamic setting<br />NOT a replacement for off-line algorithms in static settings<br />
    23. 23. Partitioning Example<br />2,1<br />1,1<br />1,2<br />NodeSum(Edges)<br />1,1 0<br />2,1 2<br />1,2 1<br />
    24. 24. Partitioning Example<br />2,1<br />1,1<br />1,2<br />
    25. 25. Experimental Parameters<br />
    26. 26. Partitioning Quality Results<br />% Edges in partition<br />Vertices in graph<br />On-line partitions as well as Kernighan-Lin<br />
    27. 27. Partitioning Performance Results<br />Vertices moved<br />Vertices in graph<br />On-line partitions 2x faster than Kernighan-Lin!<br />
    28. 28. Conclusions<br />Contributions:<br />A novel model for scalable graph data stores that extends the key-value model<br />Key-key-valuestore<br />A high-level system design<br />A novel on-line partitioning algorithm<br />Preliminary experimental results<br />Our proposed algorithm shows promise in the distributed, dynamic setting<br />
    29. 29. What’s Ahead?<br />Prototype system implementation<br />Java, PostgreSQL<br />Performance Analysis against MongoDB, Cassandra<br />Sensitivity Analysis<br />Cloud Deployment<br />
    30. 30. Thank You!<br />Acknowledgments<br />Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder<br />ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC<br />

    ×