Navlakha, et al. (2008, June)
Graph summarization with bounded error.
SIGMOD international conference on Management of data. ACM.
Khan, K. et al. (2015)
Set-based approximate approach for lossless graph summarization
Computing 97.12
Liu, X., et al. (2014, November)
Distributed graph summarization.
23rd International Conference on Conference on Information and KM. ACM.
Aftab Alam
12 Jun 2017
Department of Computer Engineering, Kyung Hee University
Distributed graph summarization
Contents
Introduction
Conclusion
Experimental Evaluation
Graph Summarization with BE
Solution
7
6
5
2
1
4
3 Distributed Graph Summarization
Challenges in DGS
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Many interactions can be represented as graphs
– Webgraphs:
o search engine, etc.
– Social networks:
o mine user communities, viral marketing
– Email exchanges:
o security. virus spread, spam detection
– Market basket data:
o customer profiles, targeted advertising
– Netflow graphs
o (which IPs talk to each other):
o traffic patterns, security, worm attacks
• Need to compress, understand
– Webgraph ~ 50 billion edges;
social networks ~ few million, growing quickly
– Compression reduces size to one-tenth (webgraphs)
• Graph summarization is NP-hard
Large Graphs
SN
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
Out Approach
• Graph Compression (reference encoding)
– Not applicable to all graphs: use urls, node labels for compression
– Resulting structure is hard to visualize/interpret
• Graph Clustering
– Nice summary, works for generic graphs
– No compression: needs the same memory to store the graph itself
• MDL-based representation R = (S,C)
– S is a high-level summary graph:
o compact, highlights dominant trends, easy to visualize
– C is a set of edge corrections:
o help in reconstructing the graph
– Compression based on MDL principle:
o minimize cost of S+C
information-theoretic approach; parameter less; applicable to any graph
– Novel Approximate Representation:
o reconstructs graph with bounded error (є);
o results in better compression
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
How do we compress?
• Compression possible (S)
– Many nodes with similar neighborhoods
o Communities in social networks
o link-copying in webpages
– Collapse
o such nodes into supernodes
o and the edges into superedges
o Bipartite subgraph to two supernodes and a superedge
o Clique to supernode with a “self-edge”
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Need to correct mistakes (C)
– Most superedges are not complete
o Nodes don’t have exact same neighbors:
 friends in social networks
– Remember edge-corrections
o Edges not present in superedges
 (-ve corrections)
o Extra edges not counted in superedges
 (+ve corrections)
• Minimize overall storage cost = S+C
How do we compress?
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Summary S(VS, ES)
– Each supernode v represents a set of nodes Av
– Each superedge (u,v) represents all pair of edges πuv = Au x Av
• Corrections C: {(a,b); a and b are nodes of G}
• Supernodes are key, superedges/corrections easy
– Auv actual edges of G between Au and Av
– Cost with (u,v) = 1 + |πuv – Euv|
– Cost without (u,v) = |Euv|
– Choose the minimum, decides whether edge (u,v) is in S
Representation Structure R=(S,C)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Reconstructing the graph from R
– For all superedges (u,v) in S, insert all pair of edges πuv
– For all +ve corrections +(a,b), insert edge (a,b)
– For all -ve corrections -(a,b), delete edge (a,b)
Representation Structure R=(S,C)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Compressed graph
– MDL representation R=(S,C); є-representation
• Computing R=(S,C)
– GREEDY
– RANDOMIZED
Outline
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Cost of merging supernodes u and v into single
supernode w
– Recall: cost of a superedge (u,x):
o c(u,x) = min{|πvx – Avx|+1, |Avx|}
– cu = sum of costs of all its edges = Σx c(u,x)
– s(u,v) = (cu + cv – cw)/(cu + cv)
• Main idea:
– recursive bottom-up merging of supernodes
– If s(u,v) > 0, merging u and v reduces the cost of reduction
– Normalize the cost: remove bias towards high degree nodes
– Making supernodes is the key:
o superedges and corrections can be computed later
GREEDY
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Recall: s(u,v) = (cu + cv – cw)/(cu + cv)
• GREEDY algorithm
– Start with S=G
– At every step, pick the pair with max s(.) value, merge them
– If no pair has positive s(.) value, stop
GREEDY
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• GREEDY is slow
– Need to find the pair with (globally) max s(.) value
– Need to process all pair of nodes at a distance of 2-hops
– Every merge changes costs of all pairs containing Nw
• Main idea: light weight randomized procedure
– Instead of choosing the globally best pair,
– Choose (randomly) a node u
– Merge the best pair containing u
RANDOMIZED
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• Unfinished set U=VG
• At every step,
– randomly pick a node u from U
• Find the node v with max value
• If s(u,v) > 0,
– then merge u and v into w, put w in U
• Else remove u from U
• Repeat till U is not empty
RANDOMIZED
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
• CNR:
– web-graph dataset
• Routeview:
– autonomous systems topology of the internet
• Wordnet:
– English words, edges between related words (synonym, similar, etc.)
• Facebook:
– social networking
Experimental set-up
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
Cost Reduction (CNR dataset)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Graph Summarization with Bounded Error
Comparison with other schemes & Cost Breakup
80% cost of representation
is due to corrections
The proposed techniques give much
better compression
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• All existing works in graph summarization are single-process solutions,
– as a result cannot scale to large graphs.
• Introduce three distributed graph summarization algorithms (DC).
– DistGreedy
– DistRandom
– DistLSH
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Nodes and edges are distributed in different machines
– requires message passing and
– careful coordination across multiple node
• Fully distributed graph summarization to achieve better parallelization
– should fully distribute computation across different machines for efficient
parallelization.
• Minimizing computation and communication costs
– smart techniques are needed to avoid unnecessary communication & computation
Challenges in Distributed Summarization
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Proposed three distributed algorithms for large scale graph summarization
• Implemented on top of Apache Giraph
– open source distributed graph processing platform
• Dist-Greedy
– examines all pairs of nodes with 2-hop distance
– thus causes a large amount of computation and communication cost.
• Dist-Random
– Reduces the number of examined node pairs using random selection.
– But randomness negatively affects the effectiveness of the algorithm.
• Dist-LSH
Solution
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Input G = (V, E),
• Summary graph for G is: S(G) = (VS, ES).
• The summary S(G) is an aggregated graph, in which
• is a partition of the nodes in
–
• Vi a supernode,
– representing an aggregation of a subset of the original nodes.
– V(v) to denote the supernode that an original node v belongs to.
• Superedge:
– Each (Vi, Vj) ∈ ES is called a superedge,
– representing all-to-all connections between nodes in Vi and nodes in Vj
• Errors in summary graph
• The connection error among each pair of super-nodes Vi and Vj is:
Preliminaries
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Given a graph G
– and a desired number of super-nodes k,
– compute a summary graph S(G) with k super-nodes,
– such that the summary error is minimized.
• Graph summarization is NP-hard
– Difficult part is determining the super-nodes VS
– Once the supernodes are decided,
o constructing the super-edges with minimum summary
o error can be achieved in polynomial time.
Preliminaries > Graph Summarization Problem
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Giraph is an open source implementation of Pregel
• Supports
– Iterative algorithms and
– vertex-to-vertex communication in a distributed graph
• Giraph program consists
– input step (graph initialization)
– followed by a sequence of iterations (called supersteps)
– an output
• Vertex-centric model
– Each vertex
o is considered an independent computing unit
o Has a unique id, A set of outgoing edges
o application-dependent attributes of the vertex and Its edges
GIRAPH OVERVIEW
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Distributed graph summarization
– same iterative merging mechanism in the centralized algorithm
– starting from the original graph as the summary
o each node is a super-node and
o iteratively merging super-nodes until k super-nodes left.
– In Centralized algorithms easy
o Single process with share memory
o to decide which pairs of super-nodes are good candidates for merge &
o perform these merge operations
– In Giraph distributed environment,
o All the decisions and operations have to be done in a distributed way
o through message passing and synchronization
o To fully utilize the parallelization
 need to find multiple pairs of nodes to merge, and
 simultaneously merge them in each iteration.
Main idea
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Two challenges define two crucial tasks:
– Candidates-Find task
o The Candidates-Find task decides on the pairs of super-nodes to be merged.
– Merge task
o Whereas the Merge task executes these merges
• Propose three distributed graph summarization algorithms:
– Dist-Greedy,
– Dist-Random and
– Dist-LSH
• Three algorithms share the same operations in the Merge task
• Differ in how merge candidates are selected.
Challenges
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Each Giraph vertex
– Has three attributes associated with vertices
o owner-id: points to which other super-node this super-node has been merged to.
o size: records the number of nodes in the original graph contained in this super-node.
o selfconn: represents the number of edges in connecting the nodes inside this super-node.
– Two attributes associated with edges
o size: caches the number of nodes in the other adjacent super-node of the edge to avoid an
additional round of query for this value.
o conn: is the number of edges in the original graph between this super-node and the neighbor.
Giraph vertex’s Data structure
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Super-steps:
– Candidates-Find task &
– Merge task
• ExecutionPhase (Aggregator )
– Indicate COMPUTE() function currently current
Phase.
• Based on the previous value of ExecutionPhase,
– we can set the right value to this aggregator in
the PRESUPERSTEP function before each
superstep starts.
Overview
• ActiveNodes (Aggregator)
• is used to keep track of the number of super-nodes in the current summary.
• When the summary size is less or equal to the required size k,
• the value of the ExecutionPhase will be set to DONE.
• In this case, in the COMPUTE() function, every vertex will vote to halt.
• Then the whole program will finish.
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• How to find pairs of super-nodes as candidates to merge in
– DistGreedy
– DistRandom
– DistLSH.
• FindCandidates(msgs)
FINDING MERGE CANDIDATES
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• DistGreedy
– based on the centralized Greedy algorithm.
– looks at super-nodes that are 2-hops away to each other and
– thrives to find the pairs with minimum error increase.
FINDING MERGE CANDIDATES > DistGreedy
– To control the number of super-node pairs to be merged in each iteration,
o use a threshold called ErrorThreshold
o as the cutoff for which pairs qualify as merge candidates.
– every pairs with error increase < ErrorThreshold
o will become merge candidates.
– In start, ErrorThreshold = 0 (no error)
– Number of merge candidates fall below 5% of the current summary size,
o the algorithm increases ErrorThreshold by a controllable parameter,
o called ThresholdIncrease, for the subsequent iterations.
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Major task
– To compute the actual error increase for each pair of 2-
hop-away super-nodes
• simple in the centralized Greedy
• More complex in the distributed environment,
– as the information to compute the error increase is
distributed in different places.
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Error increase for merging a pair of
supernodes Vi and Vj can be decomposed
into 3 parts:
– Common Neighbor Error Increase
– Unique Neighbor Error Increase
– Self Error Increase
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Common Neighbor Error Increase
– requires the error increase associated with
the connections of Vi and Vj to all their
common neighbors.
– For a common neighbor, say Vp
o error before the merge is
o After merge =
o Thus error increase of merging Vi and Vj w.r.t.
common neighbor Vp is:
o Collectively computed common neighbors:
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Unique Neighbor Error Increase
– Computation requires only unique neighbors
of each super-node.
– Vi and Vj can independently compute this
part of error increase.
– For the unique neighbor Vq in Fig.
– Error increase associated with Vi unique
neighbors
– Similar for Vj
– The total is a simple sum of the two:
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Self Error Increase
– requires collaboration between Vi and Vj
• Between the two super-nodes,
– the one with a larger id, say Vj
– Sends its self-conn to Vi
– Then at Vi,
– self-loop error
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Finally:
– the three parts of error increase will be
aggregated at the super-node with the
smaller id, Vi in our example.
– This requires messages from
o common neighbors
o Unique Neighbors
o Self Connections
• Then Vi can simply test whether
– the total error increase is below ErrorThreshold
– or not to decide on
– whether the two super-nodes should be merged.
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• DistGreedy
– Algorithm 2
o DistGreedy’s FindCandidates function.
– There are three phases for this function.
– Giraph vertex = different roles in computation
– Aggregator ExecutionPhase
o indicate current superstep phase.
– First phase
o Giraph vertex role = common neighbor
o Vp, to a potential merge candidate Vi and Vj
o neighbors of Vp are all two hops away from each
other
o Vp will compute for all pairs of neighbors
Vi and Vj
o And send to the super-node in the pair
with the smaller id, Vi.
FINDING MERGE CANDIDATES
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• DistGreedy - Time complexity
– d = average number of neighbors of a vertex.
– then average no. of 2-hop away neighbors for a vertex is d2
– the computation of all the different 2-hop away neighbors
– complexity is O(d2, N)
o where N is the total number of vertices.
– Same for
– computation phase
o iterates through each 1-hop neighbor Vq to compute for every 2-hop neighbor Vj ,
 thus has a time complexity of O(d3 N).
– Overall DistGreedy time complexity is =
FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• DistRandom
– DistGreedy blindly examines all super-node pairs of 2 hops
o large amount of computation
o network messages.
– DistRandom randomly selects some super-node pairs to examine.
– DistRandom also has the following three supersteps.
o super-node randomly selects one neighbor
 sends a message to this neighbor, including its
» size, selfconn, all neighbors’ size and conn.
o neighbor receives the message and forwards it to a random chosen neighbor with an id
smaller than the sender.
o The 2-hop away neighbor receives this message and use it to compute the error increase. If
the error increase is above ErrorThreshold, then a merge decision is made.
– Time complexity is O(d, N)
FINDING MERGE CANDIDATES > DistRandom
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• After Candidates-Find task
– Super-nodes to be merged
• How to merge super-nodes distributedly?
• For every vertex merge
– Instead of creating a new merged super-node
– always reuse the super-node
o with the smaller id as the merged super-node.
• super-node with larger id shall set its owner-id to the
merged super-node.
– and call VOTETOHALT()
– to turn itself to inactive.
MERGING SUPER-NODES
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Issue?
– merge super-nodes Vi and Vj in to Vi
– issue is that there could be another merge
decision that requires Vi merged into Vg.
– Efficiently merge multiple super-node pairs
distributedly
– we introduce a repeatable merge decision
propagation phase to ensure all the super-nodes
know whom they eventually should be merge
into.
– This design decision is essential to save overall
supersteps and messages,
– Vertex id is much cheaper to propagate than real
vertex data.
• Decision Propagation Phase
• Connection Switch Phase
• Connection Merge Phase
• State Update Phase
MERGING SUPER-NODES
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Decision Propagation Phase
– Vi will notify Vj and Vg will notify Vi
• Connection Switch Phase
– each super-nodes to be merged
– shall notify its neighbors to update this neighbor
information
– self.size, self.conn,
– All neighbor’s nbr.sizes and
– nbr.conns.
• Connection Merge Phase
– receivers of the connection switch messages shall
update their neighbor list with the new neighbor ids
• State Update Phase
– performs the actual merge by updating all the attributes
MERGING SUPER-NODES
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Environment
– Cluster of 16-node (IBM SystemX iDataPlex dx340)
– 32GB RAM,
– Ubuntu Linux, Java 1.6, Giraph trunk version
• Dataset:
EXPERIMENTAL EVALUATION
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Log-scaled graph summary error histograms
– across different graph summary sizes for three real datasets.
EXPERIMENTAL EVALUATION
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
• Log-scaled running time histograms
– across different graph summary sizes for three real datasets..
EXPERIMENTAL EVALUATION
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Distributed Graph Summarization
EXPERIMENTAL EVALUATION
Data & Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Korea.
Conclusion and Future work
• Presented a highly compact two-part representation
– R(S,C) of the input graph G
– based on the MDL principle.
o Greedy, Random and LSH based.
– The same has been implemented in distributed environment.
Your Logo
THANK YOU!
?

Distributed graph summarization

  • 1.
    Navlakha, et al.(2008, June) Graph summarization with bounded error. SIGMOD international conference on Management of data. ACM. Khan, K. et al. (2015) Set-based approximate approach for lossless graph summarization Computing 97.12 Liu, X., et al. (2014, November) Distributed graph summarization. 23rd International Conference on Conference on Information and KM. ACM. Aftab Alam 12 Jun 2017 Department of Computer Engineering, Kyung Hee University
  • 2.
    Distributed graph summarization Contents Introduction Conclusion ExperimentalEvaluation Graph Summarization with BE Solution 7 6 5 2 1 4 3 Distributed Graph Summarization Challenges in DGS
  • 3.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Many interactions can be represented as graphs – Webgraphs: o search engine, etc. – Social networks: o mine user communities, viral marketing – Email exchanges: o security. virus spread, spam detection – Market basket data: o customer profiles, targeted advertising – Netflow graphs o (which IPs talk to each other): o traffic patterns, security, worm attacks • Need to compress, understand – Webgraph ~ 50 billion edges; social networks ~ few million, growing quickly – Compression reduces size to one-tenth (webgraphs) • Graph summarization is NP-hard Large Graphs SN
  • 4.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error Out Approach • Graph Compression (reference encoding) – Not applicable to all graphs: use urls, node labels for compression – Resulting structure is hard to visualize/interpret • Graph Clustering – Nice summary, works for generic graphs – No compression: needs the same memory to store the graph itself • MDL-based representation R = (S,C) – S is a high-level summary graph: o compact, highlights dominant trends, easy to visualize – C is a set of edge corrections: o help in reconstructing the graph – Compression based on MDL principle: o minimize cost of S+C information-theoretic approach; parameter less; applicable to any graph – Novel Approximate Representation: o reconstructs graph with bounded error (є); o results in better compression
  • 5.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error How do we compress? • Compression possible (S) – Many nodes with similar neighborhoods o Communities in social networks o link-copying in webpages – Collapse o such nodes into supernodes o and the edges into superedges o Bipartite subgraph to two supernodes and a superedge o Clique to supernode with a “self-edge”
  • 6.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Need to correct mistakes (C) – Most superedges are not complete o Nodes don’t have exact same neighbors:  friends in social networks – Remember edge-corrections o Edges not present in superedges  (-ve corrections) o Extra edges not counted in superedges  (+ve corrections) • Minimize overall storage cost = S+C How do we compress?
  • 7.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Summary S(VS, ES) – Each supernode v represents a set of nodes Av – Each superedge (u,v) represents all pair of edges πuv = Au x Av • Corrections C: {(a,b); a and b are nodes of G} • Supernodes are key, superedges/corrections easy – Auv actual edges of G between Au and Av – Cost with (u,v) = 1 + |πuv – Euv| – Cost without (u,v) = |Euv| – Choose the minimum, decides whether edge (u,v) is in S Representation Structure R=(S,C)
  • 8.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Reconstructing the graph from R – For all superedges (u,v) in S, insert all pair of edges πuv – For all +ve corrections +(a,b), insert edge (a,b) – For all -ve corrections -(a,b), delete edge (a,b) Representation Structure R=(S,C)
  • 9.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Compressed graph – MDL representation R=(S,C); є-representation • Computing R=(S,C) – GREEDY – RANDOMIZED Outline
  • 10.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Cost of merging supernodes u and v into single supernode w – Recall: cost of a superedge (u,x): o c(u,x) = min{|πvx – Avx|+1, |Avx|} – cu = sum of costs of all its edges = Σx c(u,x) – s(u,v) = (cu + cv – cw)/(cu + cv) • Main idea: – recursive bottom-up merging of supernodes – If s(u,v) > 0, merging u and v reduces the cost of reduction – Normalize the cost: remove bias towards high degree nodes – Making supernodes is the key: o superedges and corrections can be computed later GREEDY
  • 11.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Recall: s(u,v) = (cu + cv – cw)/(cu + cv) • GREEDY algorithm – Start with S=G – At every step, pick the pair with max s(.) value, merge them – If no pair has positive s(.) value, stop GREEDY
  • 12.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • GREEDY is slow – Need to find the pair with (globally) max s(.) value – Need to process all pair of nodes at a distance of 2-hops – Every merge changes costs of all pairs containing Nw • Main idea: light weight randomized procedure – Instead of choosing the globally best pair, – Choose (randomly) a node u – Merge the best pair containing u RANDOMIZED
  • 13.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • Unfinished set U=VG • At every step, – randomly pick a node u from U • Find the node v with max value • If s(u,v) > 0, – then merge u and v into w, put w in U • Else remove u from U • Repeat till U is not empty RANDOMIZED
  • 14.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error • CNR: – web-graph dataset • Routeview: – autonomous systems topology of the internet • Wordnet: – English words, edges between related words (synonym, similar, etc.) • Facebook: – social networking Experimental set-up
  • 15.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error Cost Reduction (CNR dataset)
  • 16.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Graph Summarization with Bounded Error Comparison with other schemes & Cost Breakup 80% cost of representation is due to corrections The proposed techniques give much better compression
  • 17.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • All existing works in graph summarization are single-process solutions, – as a result cannot scale to large graphs. • Introduce three distributed graph summarization algorithms (DC). – DistGreedy – DistRandom – DistLSH
  • 18.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Nodes and edges are distributed in different machines – requires message passing and – careful coordination across multiple node • Fully distributed graph summarization to achieve better parallelization – should fully distribute computation across different machines for efficient parallelization. • Minimizing computation and communication costs – smart techniques are needed to avoid unnecessary communication & computation Challenges in Distributed Summarization
  • 19.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Proposed three distributed algorithms for large scale graph summarization • Implemented on top of Apache Giraph – open source distributed graph processing platform • Dist-Greedy – examines all pairs of nodes with 2-hop distance – thus causes a large amount of computation and communication cost. • Dist-Random – Reduces the number of examined node pairs using random selection. – But randomness negatively affects the effectiveness of the algorithm. • Dist-LSH Solution
  • 20.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Input G = (V, E), • Summary graph for G is: S(G) = (VS, ES). • The summary S(G) is an aggregated graph, in which • is a partition of the nodes in – • Vi a supernode, – representing an aggregation of a subset of the original nodes. – V(v) to denote the supernode that an original node v belongs to. • Superedge: – Each (Vi, Vj) ∈ ES is called a superedge, – representing all-to-all connections between nodes in Vi and nodes in Vj • Errors in summary graph • The connection error among each pair of super-nodes Vi and Vj is: Preliminaries
  • 21.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Given a graph G – and a desired number of super-nodes k, – compute a summary graph S(G) with k super-nodes, – such that the summary error is minimized. • Graph summarization is NP-hard – Difficult part is determining the super-nodes VS – Once the supernodes are decided, o constructing the super-edges with minimum summary o error can be achieved in polynomial time. Preliminaries > Graph Summarization Problem
  • 22.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Giraph is an open source implementation of Pregel • Supports – Iterative algorithms and – vertex-to-vertex communication in a distributed graph • Giraph program consists – input step (graph initialization) – followed by a sequence of iterations (called supersteps) – an output • Vertex-centric model – Each vertex o is considered an independent computing unit o Has a unique id, A set of outgoing edges o application-dependent attributes of the vertex and Its edges GIRAPH OVERVIEW
  • 23.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Distributed graph summarization – same iterative merging mechanism in the centralized algorithm – starting from the original graph as the summary o each node is a super-node and o iteratively merging super-nodes until k super-nodes left. – In Centralized algorithms easy o Single process with share memory o to decide which pairs of super-nodes are good candidates for merge & o perform these merge operations – In Giraph distributed environment, o All the decisions and operations have to be done in a distributed way o through message passing and synchronization o To fully utilize the parallelization  need to find multiple pairs of nodes to merge, and  simultaneously merge them in each iteration. Main idea
  • 24.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Two challenges define two crucial tasks: – Candidates-Find task o The Candidates-Find task decides on the pairs of super-nodes to be merged. – Merge task o Whereas the Merge task executes these merges • Propose three distributed graph summarization algorithms: – Dist-Greedy, – Dist-Random and – Dist-LSH • Three algorithms share the same operations in the Merge task • Differ in how merge candidates are selected. Challenges
  • 25.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Each Giraph vertex – Has three attributes associated with vertices o owner-id: points to which other super-node this super-node has been merged to. o size: records the number of nodes in the original graph contained in this super-node. o selfconn: represents the number of edges in connecting the nodes inside this super-node. – Two attributes associated with edges o size: caches the number of nodes in the other adjacent super-node of the edge to avoid an additional round of query for this value. o conn: is the number of edges in the original graph between this super-node and the neighbor. Giraph vertex’s Data structure
  • 26.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Super-steps: – Candidates-Find task & – Merge task • ExecutionPhase (Aggregator ) – Indicate COMPUTE() function currently current Phase. • Based on the previous value of ExecutionPhase, – we can set the right value to this aggregator in the PRESUPERSTEP function before each superstep starts. Overview • ActiveNodes (Aggregator) • is used to keep track of the number of super-nodes in the current summary. • When the summary size is less or equal to the required size k, • the value of the ExecutionPhase will be set to DONE. • In this case, in the COMPUTE() function, every vertex will vote to halt. • Then the whole program will finish.
  • 27.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • How to find pairs of super-nodes as candidates to merge in – DistGreedy – DistRandom – DistLSH. • FindCandidates(msgs) FINDING MERGE CANDIDATES
  • 28.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • DistGreedy – based on the centralized Greedy algorithm. – looks at super-nodes that are 2-hops away to each other and – thrives to find the pairs with minimum error increase. FINDING MERGE CANDIDATES > DistGreedy – To control the number of super-node pairs to be merged in each iteration, o use a threshold called ErrorThreshold o as the cutoff for which pairs qualify as merge candidates. – every pairs with error increase < ErrorThreshold o will become merge candidates. – In start, ErrorThreshold = 0 (no error) – Number of merge candidates fall below 5% of the current summary size, o the algorithm increases ErrorThreshold by a controllable parameter, o called ThresholdIncrease, for the subsequent iterations.
  • 29.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Major task – To compute the actual error increase for each pair of 2- hop-away super-nodes • simple in the centralized Greedy • More complex in the distributed environment, – as the information to compute the error increase is distributed in different places. FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 30.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Error increase for merging a pair of supernodes Vi and Vj can be decomposed into 3 parts: – Common Neighbor Error Increase – Unique Neighbor Error Increase – Self Error Increase FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 31.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Common Neighbor Error Increase – requires the error increase associated with the connections of Vi and Vj to all their common neighbors. – For a common neighbor, say Vp o error before the merge is o After merge = o Thus error increase of merging Vi and Vj w.r.t. common neighbor Vp is: o Collectively computed common neighbors: FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 32.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Unique Neighbor Error Increase – Computation requires only unique neighbors of each super-node. – Vi and Vj can independently compute this part of error increase. – For the unique neighbor Vq in Fig. – Error increase associated with Vi unique neighbors – Similar for Vj – The total is a simple sum of the two: FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 33.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Self Error Increase – requires collaboration between Vi and Vj • Between the two super-nodes, – the one with a larger id, say Vj – Sends its self-conn to Vi – Then at Vi, – self-loop error FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 34.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Finally: – the three parts of error increase will be aggregated at the super-node with the smaller id, Vi in our example. – This requires messages from o common neighbors o Unique Neighbors o Self Connections • Then Vi can simply test whether – the total error increase is below ErrorThreshold – or not to decide on – whether the two super-nodes should be merged. FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 35.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • DistGreedy – Algorithm 2 o DistGreedy’s FindCandidates function. – There are three phases for this function. – Giraph vertex = different roles in computation – Aggregator ExecutionPhase o indicate current superstep phase. – First phase o Giraph vertex role = common neighbor o Vp, to a potential merge candidate Vi and Vj o neighbors of Vp are all two hops away from each other o Vp will compute for all pairs of neighbors Vi and Vj o And send to the super-node in the pair with the smaller id, Vi. FINDING MERGE CANDIDATES
  • 36.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • DistGreedy - Time complexity – d = average number of neighbors of a vertex. – then average no. of 2-hop away neighbors for a vertex is d2 – the computation of all the different 2-hop away neighbors – complexity is O(d2, N) o where N is the total number of vertices. – Same for – computation phase o iterates through each 1-hop neighbor Vq to compute for every 2-hop neighbor Vj ,  thus has a time complexity of O(d3 N). – Overall DistGreedy time complexity is = FINDING MERGE CANDIDATES > DistGreedy (Cont’d)
  • 37.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • DistRandom – DistGreedy blindly examines all super-node pairs of 2 hops o large amount of computation o network messages. – DistRandom randomly selects some super-node pairs to examine. – DistRandom also has the following three supersteps. o super-node randomly selects one neighbor  sends a message to this neighbor, including its » size, selfconn, all neighbors’ size and conn. o neighbor receives the message and forwards it to a random chosen neighbor with an id smaller than the sender. o The 2-hop away neighbor receives this message and use it to compute the error increase. If the error increase is above ErrorThreshold, then a merge decision is made. – Time complexity is O(d, N) FINDING MERGE CANDIDATES > DistRandom
  • 38.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • After Candidates-Find task – Super-nodes to be merged • How to merge super-nodes distributedly? • For every vertex merge – Instead of creating a new merged super-node – always reuse the super-node o with the smaller id as the merged super-node. • super-node with larger id shall set its owner-id to the merged super-node. – and call VOTETOHALT() – to turn itself to inactive. MERGING SUPER-NODES
  • 39.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Issue? – merge super-nodes Vi and Vj in to Vi – issue is that there could be another merge decision that requires Vi merged into Vg. – Efficiently merge multiple super-node pairs distributedly – we introduce a repeatable merge decision propagation phase to ensure all the super-nodes know whom they eventually should be merge into. – This design decision is essential to save overall supersteps and messages, – Vertex id is much cheaper to propagate than real vertex data. • Decision Propagation Phase • Connection Switch Phase • Connection Merge Phase • State Update Phase MERGING SUPER-NODES
  • 40.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Decision Propagation Phase – Vi will notify Vj and Vg will notify Vi • Connection Switch Phase – each super-nodes to be merged – shall notify its neighbors to update this neighbor information – self.size, self.conn, – All neighbor’s nbr.sizes and – nbr.conns. • Connection Merge Phase – receivers of the connection switch messages shall update their neighbor list with the new neighbor ids • State Update Phase – performs the actual merge by updating all the attributes MERGING SUPER-NODES
  • 41.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Environment – Cluster of 16-node (IBM SystemX iDataPlex dx340) – 32GB RAM, – Ubuntu Linux, Java 1.6, Giraph trunk version • Dataset: EXPERIMENTAL EVALUATION
  • 42.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Log-scaled graph summary error histograms – across different graph summary sizes for three real datasets. EXPERIMENTAL EVALUATION
  • 43.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization • Log-scaled running time histograms – across different graph summary sizes for three real datasets.. EXPERIMENTAL EVALUATION
  • 44.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Distributed Graph Summarization EXPERIMENTAL EVALUATION
  • 45.
    Data & KnowledgeEngineering Lab, Department of Computer Engineering, Kyung Hee University, Korea. Conclusion and Future work • Presented a highly compact two-part representation – R(S,C) of the input graph G – based on the MDL principle. o Greedy, Random and LSH based. – The same has been implemented in distributed environment.
  • 46.

Editor's Notes

  • #31 Common Neighbor Error Increase associated with the connections to the common neighbors of the two super-nodes. Unique Neighbor Error Increase captures the error increase brought by the connections to the unique neighbors of the two super-nodes. Self Error Increase This last part of error increase comes from the self connections of the two super-nodes as well as the connection between the two super-nodes if there is any.