Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudAlexander G. ConnorPanos K. ChrysanthisAlexandrosLabrinidisAdvanced Data Management Technologies LaboratoryDepartment of Computer ScienceUniversity of Pittsburgh
Data in social networksA social network manages user profiles, updates and connectionsHow to manage this data in a scalable way?Key-value stores offer performance under high loadSome observations about social networksA profile view usually includes data from a user’s friendsSpatial localityA friend’s profile is often visited nextTemporal localityRequests might ask for updates from several usersWeb pages might include pieces of several user profilesA single request requires connecting to many machines
Connections in a Social NetworkAlice
Leveraging LocalityCan we take advantage of the connections?What if we stored connected user’s profiles and data in the same place?Locality can be leveraged The number of connections is reducedUser data can be pre-fetchedWe can think of this as a graph partitioning problem…Partitions = machinesVertices = user profiles, including updateEdges = connectionsObjective: minimize the number of edges that cross partitions
Example – graph partitioningMany edges cross partitions
Accessing a vertex’s neighbors requires accessing many partitions
In a social network, requesting updates from followed users requires connecting to many machines
Far fewer edges cross partitions
Accessing a vertex’s neighbors requires accessing few partitions
In a social network, fewer connections are made and related user data can be pre-fetchedKey-Key-Value StoresOur proposed approach: extend the key-value modelData can be stored key-valuesUser profilesData can also be stored as key-key-valuesUser connections“Alice follows Bob”Use key-key-values to compute localityOn-line graph partitioning algorithmAssign keys to grid locations based on connectionsEach grid cell represents a data hostKeys that are related are kept together
OutlineIntroductionData in Social NetworksLeveraging LocalityKey-Key-Value StoresSystem ModelClient APIAdding a Key-Key-ValueLoad managementOn-line partitioning algorithmSimulation ParametersResultsConclusion
Address Table: Mapping Storea transactional, distributed hash table
maps keys to virtual machinesPhysical Layer: Physical machinescan be added or removed dynamically as demands changeLogical Layer: Virtual machinesOrganized as a square grid
Run the KKV store software
Manage replication
Can be moved between physical machines as neededApplication Layer: Client APImaintain client sessions
cached dataApplication SessionsAddress tableVirtual hostsPhysical hosts
Client API and SessionsClients use a simple API that includes the get, put and sync commandsData is pulled from the logical layer in blocksGroups of related keysThe client API keeps data in an in-memory cacheData is pushed out asynchronously to virtual nodes in blocksPush/pull can be done synchronously if requested by the clientOffers stronger consistency at the cost of performance
Adding a key-key-valueput(alice, bob, follows)The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connectedTwo users: Alice and BobWrite the data to that nodeWrite the same data to that nodeUse the Address Table to determine the virtual machine (node) that hosts Alice’s dataUse the address table to determine the node that hosts Bob’s dataAddress tablebob8,88,8alice1,1Virtual hostskv(bob, ...)...kkv(alice, bob, follows)kv(alice, ...)...kkv(alice, bob, follows)1,18,8
Once the split is complete, new physical machines can be turned onVirtual nodes can be transferred to these new machinesIf one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitVirtual hostsSplitting a Node
OutlineIntroductionData in Social NetworksLeveraging LocalityKey-Key-Value StoresSystem ModelClient APIAdding a Key-Key-ValueLoad managementOn-line Partitioning AlgorithmSimulation ParametersResultsConclusion

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

  • 1.
    Key-Key-Value Stores forEfficiently Processing Graph Data in the CloudAlexander G. ConnorPanos K. ChrysanthisAlexandrosLabrinidisAdvanced Data Management Technologies LaboratoryDepartment of Computer ScienceUniversity of Pittsburgh
  • 2.
    Data in socialnetworksA social network manages user profiles, updates and connectionsHow to manage this data in a scalable way?Key-value stores offer performance under high loadSome observations about social networksA profile view usually includes data from a user’s friendsSpatial localityA friend’s profile is often visited nextTemporal localityRequests might ask for updates from several usersWeb pages might include pieces of several user profilesA single request requires connecting to many machines
  • 3.
    Connections in aSocial NetworkAlice
  • 4.
    Leveraging LocalityCan wetake advantage of the connections?What if we stored connected user’s profiles and data in the same place?Locality can be leveraged The number of connections is reducedUser data can be pre-fetchedWe can think of this as a graph partitioning problem…Partitions = machinesVertices = user profiles, including updateEdges = connectionsObjective: minimize the number of edges that cross partitions
  • 5.
    Example – graphpartitioningMany edges cross partitions
  • 6.
    Accessing a vertex’sneighbors requires accessing many partitions
  • 7.
    In a socialnetwork, requesting updates from followed users requires connecting to many machines
  • 8.
    Far fewer edgescross partitions
  • 9.
    Accessing a vertex’sneighbors requires accessing few partitions
  • 10.
    In a socialnetwork, fewer connections are made and related user data can be pre-fetchedKey-Key-Value StoresOur proposed approach: extend the key-value modelData can be stored key-valuesUser profilesData can also be stored as key-key-valuesUser connections“Alice follows Bob”Use key-key-values to compute localityOn-line graph partitioning algorithmAssign keys to grid locations based on connectionsEach grid cell represents a data hostKeys that are related are kept together
  • 11.
    OutlineIntroductionData in SocialNetworksLeveraging LocalityKey-Key-Value StoresSystem ModelClient APIAdding a Key-Key-ValueLoad managementOn-line partitioning algorithmSimulation ParametersResultsConclusion
  • 12.
    Address Table: MappingStorea transactional, distributed hash table
  • 13.
    maps keys tovirtual machinesPhysical Layer: Physical machinescan be added or removed dynamically as demands changeLogical Layer: Virtual machinesOrganized as a square grid
  • 14.
    Run the KKVstore software
  • 15.
  • 16.
    Can be movedbetween physical machines as neededApplication Layer: Client APImaintain client sessions
  • 17.
    cached dataApplication SessionsAddresstableVirtual hostsPhysical hosts
  • 18.
    Client API andSessionsClients use a simple API that includes the get, put and sync commandsData is pulled from the logical layer in blocksGroups of related keysThe client API keeps data in an in-memory cacheData is pushed out asynchronously to virtual nodes in blocksPush/pull can be done synchronously if requested by the clientOffers stronger consistency at the cost of performance
  • 19.
    Adding a key-key-valueput(alice,bob, follows)The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connectedTwo users: Alice and BobWrite the data to that nodeWrite the same data to that nodeUse the Address Table to determine the virtual machine (node) that hosts Alice’s dataUse the address table to determine the node that hosts Bob’s dataAddress tablebob8,88,8alice1,1Virtual hostskv(bob, ...)...kkv(alice, bob, follows)kv(alice, ...)...kkv(alice, bob, follows)1,18,8
  • 20.
    Once the splitis complete, new physical machines can be turned onVirtual nodes can be transferred to these new machinesIf one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitVirtual hostsSplitting a Node
  • 21.
    OutlineIntroductionData in SocialNetworksLeveraging LocalityKey-Key-Value StoresSystem ModelClient APIAdding a Key-Key-ValueLoad managementOn-line Partitioning AlgorithmSimulation ParametersResultsConclusion
  • 22.
    On-line Partitioning AlgorithmRunsperiodically in parallel on each virtual nodeAlso after a split or mergeFor each key stored on a nodeDetermine the number of connections (key-key-values) with keys on other nodesCan also be sum of edge weightsFind the node that has the most connectionsIf that node is different than the current nodeIf the number of connections to that node is greater than the number of connections to the current nodeIf this margin is greater than some thresholdMove the key to the other nodeUpdate the address tableDesigned to work in a distributed, dynamic settingNOT a replacement for off-line algorithms in static settings
  • 23.
  • 24.
  • 25.
  • 26.
    Partitioning Quality Results%Edges in partitionVertices in graphOn-line partitions as well as Kernighan-Lin
  • 27.
    Partitioning Performance ResultsVerticesmovedVertices in graphOn-line partitions 2x faster than Kernighan-Lin!
  • 28.
    ConclusionsContributions:A novel modelfor scalable graph data stores that extends the key-value modelKey-key-valuestoreA high-level system designA novel on-line partitioning algorithmPreliminary experimental resultsOur proposed algorithm shows promise in the distributed, dynamic setting
  • 29.
    What’s Ahead?Prototype systemimplementationJava, PostgreSQLPerformance Analysis against MongoDB, CassandraSensitivity AnalysisCloud Deployment
  • 30.
    Thank You!AcknowledgmentsDaniel Cole,Nick Farnan, Thao Pham, Sean SnyderADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC

Editor's Notes

  • #11 Two users: Alice and BobPut command – store “Alice Follows Bob”Use the Address Table to determine the virtual machine (node) that hosts Alice’s dataWrite the data to that nodeUse the address table to determine the node that hosts Bob’s dataWrite the same data to that nodeThe on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
  • #12 Nodes in the logical layer have to handle varying demandsIf one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitThe grid is used for replicationIt is used for efficient locking and messagingOnce the split is complete, new physical machines can be turned onVirtual nodes can be transferred to these new machinesSimilarly, as load decreases virtual nodes can be transferred off of physical machinesSome physical machines can then be shut down to save powerVirtual nodes can be merged back together
  • #14 Works by improving partitions – doesn’t create them from scratchOn-line means that it works with a changing graph – structure frequently changes
  • #15 The algorithm runs in parallel on each node When a split or merge occurs When load is below a thresholdEach vertex is considered in turn Find the number of edges to each node Edges can be weighted Find the node with the greatest no. edges If different, and the gain is > threshold, move vertex