Distributed graph summarization

Navlakha, et al. (2008, June)
Graph summarization with bounded error.
SIGMOD international conference on Management of data. ACM.
Khan, K. et al. (2015)
Set-based approximate approach for lossless graph summarization
Computing 97.12
Liu, X., et al. (2014, November)
Distributed graph summarization.
23rd International Conference on Conference on Information and KM. ACM.
Aftab Alam
12 Jun 2017
Department of Computer Engineering, Kyung Hee University

Distributed graph summarization
Contents
Introduction
Conclusion
Experimental Evaluation
Graph Summarization with BE
Solution
7
6
5
2
1
4
3 Distributed Graph Summarization
Challenges in DGS

Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Many interactions can be represented as graphs
– Webgraphs:
o search engine, etc.
– Social networks:
o mine user communities, viral marketing
– Email exchanges:
o security. virus spread, spam detection
– Market basket data:
o customer profiles, targeted advertising
– Netflow graphs
o (which IPs talk to each other):
o traffic patterns, security, worm attacks
• Need to compress, understand
– Webgraph ~ 50 billion edges;
social networks ~ few million, growing quickly
– Compression reduces size to one-tenth (webgraphs)
• Graph summarization is NP-hard
Large Graphs
SN

Out Approach
• Graph Compression (reference encoding)
– Not applicable to all graphs: use urls, node labels for compression
– Resulting structure is hard to visualize/interpret
• Graph Clustering
– Nice summary, works for generic graphs
– No compression: needs the same memory to store the graph itself
• MDL-based representation R = (S,C)
– S is a high-level summary graph:
o compact, highlights dominant trends, easy to visualize
– C is a set of edge corrections:
o help in reconstructing the graph
– Compression based on MDL principle:
o minimize cost of S+C
information-theoretic approach; parameter less; applicable to any graph
– Novel Approximate Representation:
o reconstructs graph with bounded error (є);
o results in better compression

How do we compress?
• Compression possible (S)
– Many nodes with similar neighborhoods
o Communities in social networks
o link-copying in webpages
– Collapse
o such nodes into supernodes
o and the edges into superedges
o Bipartite subgraph to two supernodes and a superedge
o Clique to supernode with a “self-edge”

• Need to correct mistakes (C)
– Most superedges are not complete
o Nodes don’t have exact same neighbors:
 friends in social networks
– Remember edge-corrections
o Edges not present in superedges
 (-ve corrections)
o Extra edges not counted in superedges
 (+ve corrections)
• Minimize overall storage cost = S+C
How do we compress?

• Summary S(VS, ES)
– Each supernode v represents a set of nodes Av
– Each superedge (u,v) represents all pair of edges πuv = Au x Av
• Corrections C: {(a,b); a and b are nodes of G}
• Supernodes are key, superedges/corrections easy
– Auv actual edges of G between Au and Av
– Cost with (u,v) = 1 + |πuv – Euv|
– Cost without (u,v) = |Euv|
– Choose the minimum, decides whether edge (u,v) is in S
Representation Structure R=(S,C)

• Reconstructing the graph from R
– For all superedges (u,v) in S, insert all pair of edges πuv
– For all +ve corrections +(a,b), insert edge (a,b)
– For all -ve corrections -(a,b), delete edge (a,b)
Representation Structure R=(S,C)

• Compressed graph
– MDL representation R=(S,C); є-representation
• Computing R=(S,C)
– GREEDY
– RANDOMIZED
Outline

• Cost of merging supernodes u and v into single
supernode w
– Recall: cost of a superedge (u,x):
o c(u,x) = min{|πvx – Avx|+1, |Avx|}
– cu = sum of costs of all its edges = Σx c(u,x)
– s(u,v) = (cu + cv – cw)/(cu + cv)
• Main idea:
– recursive bottom-up merging of supernodes
– If s(u,v) > 0, merging u and v reduces the cost of reduction
– Normalize the cost: remove bias towards high degree nodes
– Making supernodes is the key:
o superedges and corrections can be computed later
GREEDY

• Recall: s(u,v) = (cu + cv – cw)/(cu + cv)
• GREEDY algorithm
– Start with S=G
– At every step, pick the pair with max s(.) value, merge them
– If no pair has positive s(.) value, stop
GREEDY

• GREEDY is slow
– Need to find the pair with (globally) max s(.) value
– Need to process all pair of nodes at a distance of 2-hops
– Every merge changes costs of all pairs containing Nw
• Main idea: light weight randomized procedure
– Instead of choosing the globally best pair,
– Choose (randomly) a node u
– Merge the best pair containing u
RANDOMIZED

• Unfinished set U=VG
• At every step,
– randomly pick a node u from U
• Find the node v with max value
• If s(u,v) > 0,
– then merge u and v into w, put w in U
• Else remove u from U
• Repeat till U is not empty
RANDOMIZED

• CNR:
– web-graph dataset
• Routeview:
– autonomous systems topology of the internet
• Wordnet:
– English words, edges between related words (synonym, similar, etc.)
• Facebook:
– social networking
Experimental set-up

Cost Reduction (CNR dataset)

Comparison with other schemes & Cost Breakup
80% cost of representation
is due to corrections
The proposed techniques give much
better compression

Distributed Graph Summarization
• All existing works in graph summarization are single-process solutions,
– as a result cannot scale to large graphs.
• Introduce three distributed graph summarization algorithms (DC).
– DistGreedy
– DistRandom
– DistLSH

• Nodes and edges are distributed in different machines
– requires message passing and
– careful coordination across multiple node
• Fully distributed graph summarization to achieve better parallelization
– should fully distribute computation across different machines for efficient
parallelization.
• Minimizing computation and communication costs
– smart techniques are needed to avoid unnecessary communication & computation
Challenges in Distributed Summarization

• Proposed three distributed algorithms for large scale graph summarization
• Implemented on top of Apache Giraph
– open source distributed graph processing platform
• Dist-Greedy
– examines all pairs of nodes with 2-hop distance
– thus causes a large amount of computation and communication cost.
• Dist-Random
– Reduces the number of examined node pairs using random selection.
– But randomness negatively affects the effectiveness of the algorithm.
• Dist-LSH
Solution

• Input G = (V, E),
• Summary graph for G is: S(G) = (VS, ES).
• The summary S(G) is an aggregated graph, in which
• is a partition of the nodes in
–
• Vi a supernode,
– representing an aggregation of a subset of the original nodes.
– V(v) to denote the supernode that an original node v belongs to.
• Superedge:
– Each (Vi, Vj) ∈ ES is called a superedge,
– representing all-to-all connections between nodes in Vi and nodes in Vj
• Errors in summary graph
• The connection error among each pair of super-nodes Vi and Vj is:
Preliminaries

• Given a graph G
– and a desired number of super-nodes k,
– compute a summary graph S(G) with k super-nodes,
– such that the summary error is minimized.
• Graph summarization is NP-hard
– Difficult part is determining the super-nodes VS
– Once the supernodes are decided,
o constructing the super-edges with minimum summary
o error can be achieved in polynomial time.
Preliminaries > Graph Summarization Problem

• Giraph is an open source implementation of Pregel
• Supports
– Iterative algorithms and
– vertex-to-vertex communication in a distributed graph
• Giraph program consists
– input step (graph initialization)
– followed by a sequence of iterations (called supersteps)
– an output
• Vertex-centric model
– Each vertex
o is considered an independent computing unit
o Has a unique id, A set of outgoing edges
o application-dependent attributes of the vertex and Its edges
GIRAPH OVERVIEW

• Distributed graph summarization
– same iterative merging mechanism in the centralized algorithm
– starting from the original graph as the summary
o each node is a super-node and
o iteratively merging super-nodes until k super-nodes left.
– In Centralized algorithms easy
o Single process with share memory
o to decide which pairs of super-nodes are good candidates for merge &
o perform these merge operations
– In Giraph distributed environment,
o All the decisions and operations have to be done in a distributed way
o through message passing and synchronization
o To fully utilize the parallelization
 need to find multiple pairs of nodes to merge, and
 simultaneously merge them in each iteration.
Main idea

• Two challenges define two crucial tasks:
– Candidates-Find task
o The Candidates-Find task decides on the pairs of super-nodes to be merged.
– Merge task
o Whereas the Merge task executes these merges
• Propose three distributed graph summarization algorithms:
– Dist-Greedy,
– Dist-Random and
– Dist-LSH
• Three algorithms share the same operations in the Merge task
• Differ in how merge candidates are selected.
Challenges

• Each Giraph vertex
– Has three attributes associated with vertices
o owner-id: points to which other super-node this super-node has been merged to.
o size: records the number of nodes in the original graph contained in this super-node.
o selfconn: represents the number of edges in connecting the nodes inside this super-node.
– Two attributes associated with edges
o size: caches the number of nodes in the other adjacent super-node of the edge to avoid an
additional round of query for this value.
o conn: is the number of edges in the original graph between this super-node and the neighbor.
Giraph vertex’s Data structure

• Super-steps:
– Candidates-Find task &
– Merge task
• ExecutionPhase (Aggregator )
– Indicate COMPUTE() function currently current
Phase.
• Based on the previous value of ExecutionPhase,
– we can set the right value to this aggregator in
the PRESUPERSTEP function before each
superstep starts.
Overview
• ActiveNodes (Aggregator)
• is used to keep track of the number of super-nodes in the current summary.
• When the summary size is less or equal to the required size k,
• the value of the ExecutionPhase will be set to DONE.
• In this case, in the COMPUTE() function, every vertex will vote to halt.
• Then the whole program will finish.

• How to find pairs of super-nodes as candidates to merge in
– DistGreedy
– DistRandom
– DistLSH.
• FindCandidates(msgs)
FINDING MERGE CANDIDATES

• DistGreedy
– based on the centralized Greedy algorithm.
– looks at super-nodes that are 2-hops away to each other and
– thrives to find the pairs with minimum error increase.
FINDING MERGE CANDIDATES > DistGreedy
– To control the number of super-node pairs to be merged in each iteration,
o use a threshold called ErrorThreshold
o as the cutoff for which pairs qualify as merge candidates.
– every pairs with error increase < ErrorThreshold
o will become merge candidates.
– In start, ErrorThreshold = 0 (no error)
– Number of merge candidates fall below 5% of the current summary size,
o the algorithm increases ErrorThreshold by a controllable parameter,
o called ThresholdIncrease, for the subsequent iterations.

• Major task
– To compute the actual error increase for each pair of 2-
hop-away super-nodes
• simple in the centralized Greedy
• More complex in the distributed environment,
– as the information to compute the error increase is
distributed in different places.
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)

• Error increase for merging a pair of
supernodes Vi and Vj can be decomposed
into 3 parts:
– Common Neighbor Error Increase
– Unique Neighbor Error Increase
– Self Error Increase

• Common Neighbor Error Increase
– requires the error increase associated with
the connections of Vi and Vj to all their
common neighbors.
– For a common neighbor, say Vp
o error before the merge is
o After merge =
o Thus error increase of merging Vi and Vj w.r.t.
common neighbor Vp is:
o Collectively computed common neighbors:

• Unique Neighbor Error Increase
– Computation requires only unique neighbors
of each super-node.
– Vi and Vj can independently compute this
part of error increase.
– For the unique neighbor Vq in Fig.
– Error increase associated with Vi unique
neighbors
– Similar for Vj
– The total is a simple sum of the two:

• Self Error Increase
– requires collaboration between Vi and Vj
• Between the two super-nodes,
– the one with a larger id, say Vj
– Sends its self-conn to Vi
– Then at Vi,
– self-loop error

• Finally:
– the three parts of error increase will be
aggregated at the super-node with the
smaller id, Vi in our example.
– This requires messages from
o common neighbors
o Unique Neighbors
o Self Connections
• Then Vi can simply test whether
– the total error increase is below ErrorThreshold
– or not to decide on
– whether the two super-nodes should be merged.

• DistGreedy
– Algorithm 2
o DistGreedy’s FindCandidates function.
– There are three phases for this function.
– Giraph vertex = different roles in computation
– Aggregator ExecutionPhase
o indicate current superstep phase.
– First phase
o Giraph vertex role = common neighbor
o Vp, to a potential merge candidate Vi and Vj
o neighbors of Vp are all two hops away from each
other
o Vp will compute for all pairs of neighbors
Vi and Vj
o And send to the super-node in the pair
with the smaller id, Vi.
FINDING MERGE CANDIDATES

• DistGreedy - Time complexity
– d = average number of neighbors of a vertex.
– then average no. of 2-hop away neighbors for a vertex is d2
– the computation of all the different 2-hop away neighbors
– complexity is O(d2, N)
o where N is the total number of vertices.
– Same for
– computation phase
o iterates through each 1-hop neighbor Vq to compute for every 2-hop neighbor Vj ,
 thus has a time complexity of O(d3 N).
– Overall DistGreedy time complexity is =

• DistRandom
– DistGreedy blindly examines all super-node pairs of 2 hops
o large amount of computation
o network messages.
– DistRandom randomly selects some super-node pairs to examine.
– DistRandom also has the following three supersteps.
o super-node randomly selects one neighbor
 sends a message to this neighbor, including its
» size, selfconn, all neighbors’ size and conn.
o neighbor receives the message and forwards it to a random chosen neighbor with an id
smaller than the sender.
o The 2-hop away neighbor receives this message and use it to compute the error increase. If
the error increase is above ErrorThreshold, then a merge decision is made.
– Time complexity is O(d, N)
FINDING MERGE CANDIDATES > DistRandom

• After Candidates-Find task
– Super-nodes to be merged
• How to merge super-nodes distributedly?
• For every vertex merge
– Instead of creating a new merged super-node
– always reuse the super-node
o with the smaller id as the merged super-node.
• super-node with larger id shall set its owner-id to the
merged super-node.
– and call VOTETOHALT()
– to turn itself to inactive.
MERGING SUPER-NODES

• Issue?
– merge super-nodes Vi and Vj in to Vi
– issue is that there could be another merge
decision that requires Vi merged into Vg.
– Efficiently merge multiple super-node pairs
distributedly
– we introduce a repeatable merge decision
propagation phase to ensure all the super-nodes
know whom they eventually should be merge
into.
– This design decision is essential to save overall
supersteps and messages,
– Vertex id is much cheaper to propagate than real
vertex data.
• Decision Propagation Phase
• Connection Switch Phase
• Connection Merge Phase
• State Update Phase
MERGING SUPER-NODES

• Decision Propagation Phase
– Vi will notify Vj and Vg will notify Vi
• Connection Switch Phase
– each super-nodes to be merged
– shall notify its neighbors to update this neighbor
information
– self.size, self.conn,
– All neighbor’s nbr.sizes and
– nbr.conns.
• Connection Merge Phase
– receivers of the connection switch messages shall
update their neighbor list with the new neighbor ids
• State Update Phase
– performs the actual merge by updating all the attributes
MERGING SUPER-NODES

• Environment
– Cluster of 16-node (IBM SystemX iDataPlex dx340)
– 32GB RAM,
– Ubuntu Linux, Java 1.6, Giraph trunk version
• Dataset:
EXPERIMENTAL EVALUATION

• Log-scaled graph summary error histograms
– across different graph summary sizes for three real datasets.

• Log-scaled running time histograms
– across different graph summary sizes for three real datasets..

Conclusion and Future work
• Presented a highly compact two-part representation
– R(S,C) of the input graph G
– based on the MDL principle.
o Greedy, Random and LSH based.
– The same has been implemented in distributed environment.

Distributed graph summarization

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Distributed graph summarization

Similar to Distributed graph summarization (20)

Recently uploaded

Recently uploaded (20)

Distributed graph summarization

Editor's Notes