Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

  • 11,282 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
11,282
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
46
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Two users: Alice and BobPut command – store “Alice Follows Bob”Use the Address Table to determine the virtual machine (node) that hosts Alice’s dataWrite the data to that nodeUse the address table to determine the node that hosts Bob’s dataWrite the same data to that nodeThe on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
  • Nodes in the logical layer have to handle varying demandsIf one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitThe grid is used for replicationIt is used for efficient locking and messagingOnce the split is complete, new physical machines can be turned onVirtual nodes can be transferred to these new machinesSimilarly, as load decreases virtual nodes can be transferred off of physical machinesSome physical machines can then be shut down to save powerVirtual nodes can be merged back together
  • Works by improving partitions – doesn’t create them from scratchOn-line means that it works with a changing graph – structure frequently changes
  • The algorithm runs in parallel on each node When a split or merge occurs When load is below a thresholdEach vertex is considered in turn Find the number of edges to each node Edges can be weighted Find the node with the greatest no. edges If different, and the gain is > threshold, move vertex

Transcript

  • 1. Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
    Alexander G. Connor
    Panos K. Chrysanthis
    AlexandrosLabrinidis
    Advanced Data Management Technologies Laboratory
    Department of Computer Science
    University of Pittsburgh
  • 2. Data in social networks
    A social network manages user profiles, updates and connections
    How to manage this data in a scalable way?
    Key-value stores offer performance under high load
    Some observations about social networks
    A profile view usually includes data from a user’s friends
    Spatial locality
    A friend’s profile is often visited next
    Temporal locality
    Requests might ask for updates from several users
    Web pages might include pieces of several user profiles
    A single request requires connecting to many machines
  • 3. Connections in a Social Network
    Alice
  • 4. Leveraging Locality
    Can we take advantage of the connections?
    What if we stored connected user’s profiles and data in the same place?
    Locality can be leveraged
    The number of connections is reduced
    User data can be pre-fetched
    We can think of this as a graph partitioning problem…
    Partitions = machines
    Vertices = user profiles, including update
    Edges = connections
    Objective: minimize the number of edges that cross partitions
  • 5. Example – graph partitioning
    • Many edges cross partitions
    • 6. Accessing a vertex’s neighbors requires accessing many partitions
    • 7. In a social network, requesting updates from followed users requires connecting to many machines
    • 8. Far fewer edges cross partitions
    • 9. Accessing a vertex’s neighbors requires accessing few partitions
    • 10. In a social network, fewer connections are made and related user data can be pre-fetched
  • Key-Key-Value Stores
    Our proposed approach: extend the key-value model
    Data can be stored key-values
    User profiles
    Data can also be stored as key-key-values
    User connections
    “Alice follows Bob”
    Use key-key-values to compute locality
    On-line graph partitioning algorithm
    Assign keys to grid locations based on connections
    Each grid cell represents a data host
    Keys that are related are kept together
  • 11. Outline
    Introduction
    Data in Social Networks
    Leveraging Locality
    Key-Key-Value Stores
    System Model
    Client API
    Adding a Key-Key-Value
    Load management
    On-line partitioning algorithm
    Simulation Parameters
    Results
    Conclusion
  • 12. Address Table: Mapping Store
    • a transactional, distributed hash table
    • 13. maps keys to virtual machines
    Physical Layer: Physical machines
    • can be added or removed dynamically as demands change
    Logical Layer: Virtual machines
    • Organized as a square grid
    • 14. Run the KKV store software
    • 15. Manage replication
    • 16. Can be moved between physical machines as needed
    Application Layer: Client API
    • maintain client sessions
    • 17. cached data
    Application Sessions
    Address table
    Virtual hosts
    Physical hosts
  • 18. Client API and Sessions
    Clients use a simple API that includes the get, put and sync commands
    Data is pulled from the logical layer in blocks
    Groups of related keys
    The client API keeps data in an in-memory cache
    Data is pushed out asynchronously to virtual nodes in blocks
    Push/pull can be done synchronously if requested by the client
    Offers stronger consistency at the cost of performance
  • 19. Adding a key-key-value
    put(alice, bob, follows)
    The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
    Two users: Alice and Bob
    Write the data to that node
    Write the same data to that node
    Use the Address Table to determine the virtual machine (node) that hosts Alice’s data
    Use the address table to determine the node that hosts Bob’s data
    Address table
    bob
    8,8
    8,8
    alice
    1,1
    Virtual hosts
    kv(bob, ...)
    ...
    kkv(alice, bob, follows)
    kv(alice, ...)
    ...
    kkv(alice, bob, follows)
    1,1
    8,8
  • 20. Once the split is complete, new physical machines can be turned on
    • Virtual nodes can be transferred to these new machines
    If one node becomes overloaded, it can initiate a split
    To maintain the grid structure, nodes in the same row and column must also split
    Virtual hosts
    Splitting a Node
  • 21. Outline
    Introduction
    Data in Social Networks
    Leveraging Locality
    Key-Key-Value Stores
    System Model
    Client API
    Adding a Key-Key-Value
    Load management
    On-line Partitioning Algorithm
    Simulation Parameters
    Results
    Conclusion
  • 22. On-line Partitioning Algorithm
    Runs periodically in parallel on each virtual node
    Also after a split or merge
    For each key stored on a node
    Determine the number of connections (key-key-values) with keys on other nodes
    Can also be sum of edge weights
    Find the node that has the most connections
    If that node is different than the current node
    If the number of connections to that node is greater than the number of connections to the current node
    If this margin is greater than some threshold
    Move the key to the other node
    Update the address table
    Designed to work in a distributed, dynamic setting
    NOT a replacement for off-line algorithms in static settings
  • 23. Partitioning Example
    2,1
    1,1
    1,2
    NodeSum(Edges)
    1,1 0
    2,1 2
    1,2 1
  • 24. Partitioning Example
    2,1
    1,1
    1,2
  • 25. Experimental Parameters
  • 26. Partitioning Quality Results
    % Edges in partition
    Vertices in graph
    On-line partitions as well as Kernighan-Lin
  • 27. Partitioning Performance Results
    Vertices moved
    Vertices in graph
    On-line partitions 2x faster than Kernighan-Lin!
  • 28. Conclusions
    Contributions:
    A novel model for scalable graph data stores that extends the key-value model
    Key-key-valuestore
    A high-level system design
    A novel on-line partitioning algorithm
    Preliminary experimental results
    Our proposed algorithm shows promise in the distributed, dynamic setting
  • 29. What’s Ahead?
    Prototype system implementation
    Java, PostgreSQL
    Performance Analysis against MongoDB, Cassandra
    Sensitivity Analysis
    Cloud Deployment
  • 30. Thank You!
    Acknowledgments
    Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder
    ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC