Successfully reported this slideshow.
Upcoming SlideShare
×

# Graph Summarization with Quality Guarantees

434 views

Published on

Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Graph Summarization with Quality Guarantees

1. 1. Graph Summarization with Quality Guarantees Matteo Riondato – Labs, Two Sigma Investments LP DEI – UNIPD – September 26, 2016 1 / 32
2. 2. Who am I? Laurea in Ing. Informazione and Laurea Spec. in Ing. Informatica @ DEI-UNIPD; Ph.D. in Computer Science at Brown U. with Eli Upfal; Working at Labs, Two Sigma Investments LP (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Research in algorithmic data science: algorithms plus data, algorithms for data (used to be data mining, but then data miners forgot about algorithms...); Tweeting @teorionda; “Living” at http://matteo.rionda.to. 2 / 32
3. 3. What is my approach to research? Not This: Theory → Practice 3 / 32
4. 4. What is my approach to research? Not This: Theory ← Practice 3 / 32
5. 5. What is my approach to research? Not This: Theory ↔ Practice 3 / 32
6. 6. What is my approach to research? Not This: Theory + Practice 3 / 32
7. 7. What is my approach to research? This: (Theory × Practice)Theory × Practice Marry Theory and Practice: make them one, leverage individual/combined strengths! We can’t look at them separately to solve larger-than-anything data problems. 3 / 32
8. 8. What is summarization? Motivation: Speeding up the answering of queries on a large graph G = ([n], E). Examples of query: “What’s the degree of v ∈ [n]?”, “How many triangles are there in G?” Summarization: creation and use of a summary data structure to answer the queries; Summary: a lossy representation; of G that ﬁts in main memory (i.e., sketch); We also need algorithms using the summary to approximately answer the queries; We want a general-purpose summary, not specialized for only one query. Summarization is not: •graph compression: store a lossless compressed version of the graph on disk; •graph partitioning: partition the nodes into k classes, while minimizing the number of edges from one class to the other. 4 / 32
9. 9. What am I going to talk about? We present the ﬁrst constant-factor approximation algorithm for building graph summaries according to natural error measures; Among the contributions: a practical use of extreme graph theory, with the cut norm and the algorithmic version of Szemerédi’s Regularity Lemma. Joint work with Francesco Bonchi (ISI / Eurecat) and David García-Soriano (Eurecat). IEEE ICDM’14; Graph Summarization with Quality Guarantees Matteo Riondato Stanford University rionda@cs.stanford.edu David Garc´ıa-Soriano Yahoo Labs, Barcelona, Spain davidgs@yahoo-inc.com Francesco Bonchi Yahoo Labs, Barcelona, Spain bonchi@yahoo-inc.com Abstract—We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this paper we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. The goal is to produce a summary that minimizes the reconstruction error w.r.t. the original graph. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the ﬁrst polynomial-time approximation algorithm to compute the best possible summary of a given size. I. INTRODUCTION Data analysts in several application domains (e.g., social networks, molecular biology, communication networks, and many others) routinely face graphs with millions of vertices and billions of edges. In principle, this abundance of data should allow for a more accurate analysis of the phenomena under study. However, as the graphs under analysis grow, min- ing and visualizing them become computationally challenging tasks. In fact, the running time of most graph algorithms depends on the size of the input: executing them on huge graphs might be impractical, especially when the input is too large to ﬁt in main memory. Graph summarization speeds up the analysis by creating a lossy concise representation of the graph that ﬁts into main memory. Answers to otherwise expensive queries can then be computed using the summary without accessing the exact representation on disk. Query answers computed on the summary incur in a minimal loss of accuracy. Summaries can also be used for privacy purposes [1], to create easily interpretable visualizations of the graph [2], and to store a compressed representation of the graph. LeFevre and Terzi [1] propose an enriched “supergraph” as a summary, associating an integer to each supernode and to each The GraSS algorithm presented in [1] follows a greedy heuris- tic resembling an agglomerative hierarchical clustering using Ward’s method [3] and as such can not give any guarantee on the quality of the summary. In this paper instead, we propose efﬁcient algorithms to compute summaries of guaranteed quality (a constant factor from the optimal). This theoretical property is also veriﬁed empirically: our algorithms build more representative summaries and are much more efﬁcient and scalable than GraSS in building those summaries. II. PROBLEM DEFINITION We consider an undirected graph G = (V, E) with |V | = n. In the rest of the paper, the key concepts are deﬁned from the standpoint of the symmetric adjacency matrix AG of G. We allow the edges to be weighted (so the adjacency matrix is not necessarily binary) and we allow self-loops (so the diagonal of the adjacency matrix is not necessarily all-zero). Given a graph G = (V, E) and k 2 N, a k-summary S of G is a complete undirected weighted graph S = (V 0 , V 0 ⇥V 0 ) that is uniquely identiﬁed by a k-partition V 0 of V (i.e., V 0 = {V1, . . . , Vk}, s.t. [i2[1,k]Vi = V and 8i, j 2 [1, k], i 6= j, it holds Vi Vj = ;). The vertices of S are called supernodes. There is a superedge eij for each unordered pair of supernodes (Vi, Vj), including (Vi, Vi) (i.e., each supernode has a self-loop eii). Each superedge eij is weighted by the density of edges between Vi and Vj: dG(i, j) = dG(Vi, Vj) = eG(Vi, Vj)/(|Vi||Vj|), where for any two sets of vertices S, T ✓ V , we denote eG(S, T) = X i2S,j2T AG(i, j). We deﬁne the density matrix of S as the k⇥k matrix AS with entries AS(i, j) = dG(i, j), 1  i, j  k. For each v 2 V , we also denote by s(v) the unique element of S (a supernode) such that v 2 s(v). The density matrix AS 2 Rk⇥k can be lifted1 to the matrix A" S 2 Rn⇥n deﬁned by A" S(v, w) = AS(s(v), s(w)). Data Mining and Knowledge Discovery, in press; Data Mining and Knowledge Discovery manuscript No. (will be inserted by the editor) Graph Summarization with Quality Guarantees Matteo Riondato · David Garc´ıa-Soriano · Francesco Bonchi 5 / 32
10. 10. What does a summary look like? G = ([n], E): an undirected weighted graph with adjacency matrix AG; Summary S of G: a complete weighted graph with self-loops: S = (V, V × V); The supernodes Vi , i = 1, . . . , k, in V form a partitioning of [n]; The superedges weights are the edge densities between the corresponding sets of nodes: dG(i, j) = dG(Vi , Vj) = eG(Vi , Vj) |Vi ||Vj| , where eG(S, T) = i∈S,j∈T AG(i, j) for all S, T ⊆ V . Technical point: if S and T overlap, each non-loop edge with both endpoints in S ∩ T is counted twice for eG(S, T). 6 / 32
11. 11. Shall we look at an example? A graph with 7 nodes, and a summary of it with 3 supernodes: 2 1 3 4 75 0 7/9 3/4 1/4 1/3 1/3 {3,4,5} {6,7} {1,2} 6 7 / 32
12. 12. How can we answer queries using a summary? Preliminary question: What is the desired semantic of an answer computed on the summary? A summary S is a lossy representation of G, compatible with many other graphs; We adopt an expected-value semantic for query answers: Let π be a distribution over the set of graphs compatible with S; Given q and S, we want answer(q, S) = Eπ[answer(q, compatible G )] . We most often use the uniform distribution as π. 8 / 32
13. 13. What is the lifted adjacency matrix? Key object to compute query answers and evaluate the quality of the summary; Preliminaries: Given a summary S, let AS be the k × k matrix such that AS(i, j) = dG(i, j); Given a node v ∈ [n], let s(v) be the unique supernode of S containing v; Lifted adjacency matrix A↑ S ∈ Rn×n of S: A↑ S(u, w) = AS(s(u), s(w)) . A↑ S has a nice regular structure (example coming up); We call matrices like A↑ S “P-constant”, where P is the partitioning of [n] used in S. 9 / 32
14. 14. What does a lifted adjacency matrix look like? 2 1 3 4 75 0 7/9 3/4 1/4 1/3 1/3 {3,4,5} {6,7} {1,2} 6 1 2 3 4 5 6 7 1 0 0 1/3 1/3 1/3 1/4 1/4 2 0 0 1/3 1/3 1/3 1/4 1/4 3 1/3 1/3 7/9 7/9 7/9 1/3 1/3 4 1/3 1/3 7/9 7/9 7/9 1/3 1/3 5 1/3 1/3 7/9 7/9 7/9 1/3 1/3 6 1/4 1/4 1/3 1/3 1/3 3/4 3/4 7 1/4 1/4 1/3 1/3 1/3 3/4 3/4 10 / 32
15. 15. How do we use the lifted adjacency matrix to answer queries? Query Answer Adjacency of u and w A↑ S(u, w) Degree of u n i=1 A↑ S(u, i) Eigenvector centrality of u n i=1 A↑ S(u, i)/2|E| No. of triangles in G k i=1 |Vi | 3 A↑ S(i, i)3+ k j=i+1 A↑ S(i, j)2 |Vi | 2 |Vj|A↑ S(i, i) + |Vj | 2 |Vi |A↑ S(j, j) + + k w=j+1 |Vi ||Vj||Vw |A↑ S(i, j)A↑ S(j, w)A↑ S(w, j) This is a slight simpliﬁcation of how to answer queries: there is some additional weighting of A ↑ S involved. Takeaway: Answering queries takes time O(poly(k)). 11 / 32
16. 16. What is a good summary? Given G = ([n], E), each partitioning P of [n] corresponds to a summary SP; Which partition of [n] gives the best summary? We need an error metric to evaluate the partitions, i.e., the summaries; Intuition: compare the lifted adjacency matrix of SP with the adjacency matrix AG; Error metrics: • p-reconstruction error, based on the p norm; •cut-norm error, based on the cut norm. 12 / 32
17. 17. What is the reconstruction error? The entry-wise p-norm of the diﬀerence between AG and A↑ S: errp(AG, A↑ S) = AG − A↑ S p =   |V | i=1 |V | j=1 |AG(i, j) − A↑ S(i, j)|p   1/p . The reconstruction error is such that errp(AG, A↑ S) ∈ [0, n2/p]; For the summary in the example we had: err1(AG, A↑ S) = 329 18 = 18.27 and err2(AG, A↑ S) = 11844 1296 = 3.023059525 . 13 / 32
18. 18. How do we compute the p-reconstruction error? In time O(k2) from the summary: errp(A↑ S, AG)p = i,j∈[k] |Vi ||Vj|dG(i, j)(1 − dG(i, j)) (1 − dG(i, j))p−1 + dG(i, j)p−1 . Interesting consequence: The choice of p does not matter much. E.g.,: err2(A↑ S, AG)2 = 2 · err1(A↑ S, AG) I.e., a partition that minimizes the 2-reconstruction error, also minimizes the 1-reconstruction error; (similarly also for any errp and errq, with p, q ∈ O(1), up to constant factors) 14 / 32
19. 19. What’s wrong with the p-reconstruction error? Achieving low (e.g., o(n2)) 1-reconstruction error may require k very close to n; Example: Consider a Erdős-Renyi graph G ∈ Gn,1/2; W.h.p., for any two vertices u, v, we have N(v) − N(v) = n 2 ± √ n log n; Assigning two vertices to the same supernode has cost n 2 − o(n); Hence, even for k = n/2, the reconstruction error of any k-summary is ≥ n2/9. 15 / 32
20. 20. How can we solve this drawback? When G is not random, we hope to ﬁnd structure to exploit for building the summary; Look at adjacency matrix AG as: AG = S + E where •S: “structured” matrix captured by summary; •E: “residual” matrix representing the error; The “smaller” E is , the better the summary; The cut norm error tries to quantify E. 16 / 32
21. 21. What is the cut norm? The cut norm of B ∈ Rn×n is the maximum absolute sum of the entries of any of its submatrices: B = max S,T⊆[n] |B(S, T)| = max S,T⊆[n] i∈S,j∈T Bi,j . The cut-norm error of a summary S w.r.t. G is: err (AG, A↑ S) = AG − A↑ S = max S,T⊆V |eG(S, T) − eS↑ (S, T)|, Interpretation: maximum absolute diﬀerence between (twice) the sum of edges weights between any two sets of vertices in G and the same sum in A↑ S; Computing the cut norm is NP-hard, but exists constant-factor approximation via SDP. 17 / 32
22. 22. What is GraSS? GraSS: Graph Structure Summarization, by LeFevre and Terzi, SIAM SDM’10; Introduces the 1-reconstruction error; Deﬁnes a variant of the lifted adjacency matrix; Shows how to answer some queries using the summary; Gives an algorithm to build a summary using agglomerative hierarchical clustering: 1) Put each vertex in a separate supernode; 2) Until the number of supernodes is k: Merge the two supernodes whose merging minimizes the 1-reconstruction error; 3) Output the resulting k supernodes; GraSS does not oﬀer any quality guarantees and it is extremely slow. 18 / 32
23. 23. Why using the lifted adjacency matrix? Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p P minimizes AG − B∗,p P p? Lemma: A↑ S is a (small) constant -factor approximation of B∗,p P . 19 / 32
24. 24. Why using the lifted adjacency matrix? Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p P minimizes AG − B∗,p P p? Lemma: A↑ S is a (small) constant -factor approximation of B∗,p P . Intuition (this is not a proof): Assume to sample u.a.r. a pair of vertices (u, w) from Vi × Vj. Let X be the r.v. with value AG(u, w). What is the real a that minimizes E[|X − a|p]? For p = 1 is median(X), for p = 2 is E[X]. We have: A↑ S(i, j) = (u,v)∈Vi ×Vj AG(u, w) |Vi ||Vj| = E[X] . Hence, B∗,2 P = A↑ S; Also, we show that AG − B∗,1 P 1 ≤ AG − A↑ S 1 ≤ 2 AG − B∗,1 P . 19 / 32
25. 25. What is our algorithm for graph summarization? As simple as: •For p-reconstruction error, perform p-clustering of the rows of AG (p = 1: k-median, p = 2: k-means). If column i is in cluster j, then vertex i is in supernode Vj. Lemma: The summary obtained from the optimal 1 (resp. 2) clustering is a 8 (resp. 4) approximation of the optimal summary for the 1 (resp. 2) reconstruction error. ...not really so simple to prove! •For the cut-norm error, perform 1-clustering of the rows of A2 G, with a speciﬁc distance function; Lemma: (a tad more complex, let’s skip this.) 20 / 32
26. 26. Is there some intuition behind this connection with clustering? Given a partition P = {V1, . . . , Vk}, consider the orthonormal vectors vi ∈ Rn such that vi (j) = 1/|Vi | if j ∈ Vi , 0 othw. Consider the n × n matrix P = k i=1 vi vi ; We have A↑ S = PAGP; The p-cost of the clustering associated to P is AG − AGP p; For the p norms, AG − AGP p /2 ≤ AG − PAGP p ≤ 2 AG − AGP p; Let ¯S be the summary arising from the optimal 2-clustering, and S∗ be the optimal summary with minimal 2-reconstruction error. Then: err2(AG, A↑ ¯S ) = AG − P ¯SAGP ¯S 2 ≤ 2 AG − AGP ¯S 2 ≤ 2 AG − AGPS∗ 2 ≤ 4 AG − PS∗ AGPS∗ 2 = 4 · err2(AG, A↑ S∗ ) . 21 / 32
27. 27. How can we build a summary eﬃciently? Both k-means and k-median are NP-hard; There are constant factor approximation algorithms; Bottleneck: computing all pairwise distance for the n rows of AG is expensive (like matrix multiplication); Solution: Use a sketch of the adjacency matrix with n rows and log n columns; Incurs in additional constant error; Even with the sketch, the approximation algorithms take time ˜O(n2); Idea: select O(k) rows of the sketch adaptively, compute a clustering using them; In the end, the algorithm runs in time ˜O(m + nk) and obtains a constant-factor approximation. Proving it is not fun. . . or is it? 22 / 32
28. 28. So, what’s the algorithm? Algorithm 1: Graph summarization with p-reconstruction error Input : G = (V , E) with |V | = n, k ∈ N, p ∈ {1, 2} Output: A O(1)-approximation to the best k-summary for G under the p-reconstruction error // Create the n × O(log n) sketch matrix (Indyk, 2006) S ← createSketch(AG, O(log n), p) // Select O(k) rows from the sketch (Aggarwal et al, 2009) R ← reduceClustInstance(AG, S, k) // Run the approximation algorithm by Mettu and Plaxton (2003) to obtain a partition. P ← getApproxClustPartition(p, k, R, S) // Compute the densities for the summary D ← computeDensities(P, AG) return (P, D) 23 / 32
29. 29. What is the story for the cut norm? Szemerédi regularity lemma: in every large enough graph, there is a partitioning of the nodes into subsets of about the same size so that the edges between diﬀerent subsets behave almost randomly. Key result in extremal graph theory: if F holds on random graphs, it holds on large enough generic graphs; The partitioning can be found in polynomial time, but may have a galactic size (tower of exponentials in m); Frieze and Kannan deﬁned weak regularity, which requires fewer classes. 24 / 32
30. 30. What is weak regularity? Given G = ([n], E) and a partitioning P of [n], P is ε-weakly regular if the cut-norm error of the summary corresponding to P is at most εn2, i.e.: err (AG, A↑ SP ) = AG − A↑ SP = max S,T⊆V |eG(S, T) − eS↑ P (S, T)|, Any unweighted graph with average degree α = 2|E|/|V |2 ∈ [0, 1] admits an ε-weakly regular partition with 2O(α log(1/α)/ε)2 parts (tight bound); A speciﬁc graph may still have much smaller partitions. The optimal deterministic algorithm to build a weakly ε-regular partition with at most 21/poly(ε) parts takes time c(ε)n2. 25 / 32
31. 31. What’s wrong with this? Existing algorithms take an error bound ε and always produce a partition of size 2Θ(1/ε2); We solve the problem of constructing a partition of a given size k that minimizes the cut-norm error. Algorithm: 1) Square AG; 2) Do k-median on the rows of A2 G, using the distance function δ(u, v) = w∈V |A2 G(u, w) − A2 G(v, w)| n2 . Lemma: Let OPTp be the error of the optimal weakly-regular k-partition of [n]. The partitioning induced by the optimal k-median solution is a weakly-regular partition with cost at most 16/ OPTp. 26 / 32
32. 32. How does our algorithm perform in practice? Implementation: C++, available at http://github.com/rionda/graphsumm; Datasets: from SNAP (http://snap.stanford.edu/datasets) Facebook, Enron, Epinions1, SignEpinions, Stanford, Amazon0601 (from 4k to 400k vertices, 88k to 2.5M edges). Goals: •study the structure of summaries generated by our algorithm; •compare its performances w.r.t. GraSS; •evaluate the quality of approximate query answers. 27 / 32
33. 33. What do the summaries look like? Experiment: As k varies, build a summary with k supernodes, and measure: •distribution of supernode sizes; •distributions of internal and cross densities; •reconstruction or cut-norm error; •runtime. 28 / 32
34. 34. What do the summaries look like? Experiment: As k varies, build a summary with k supernodes, and measure: •distribution of supernode sizes; •distributions of internal and cross densities; •reconstruction or cut-norm error; •runtime. Results: •always at least one supernode containing a single node, but also supernodes with 102 or 103 nodes: high standard deviation decreasing superlinearly in k; •Some supernodes contain large quasi-cliques, but stddev of internal density is high; •All cross densities are very low, some are zero: there are independent supernodes; •The 2-reconstruction and cut-norm error shrink linearly in k; •The runtime increases linearly in k, its stddev grows with k. 28 / 32
35. 35. How does our algorithm compare to the state of the art? Compared with GraSS (implemented by us, the authors lost their implementation); Our algorithm is “better, faster, stronger”; 2-reconstr. Err. k Alg. c (for GraSS) avg stdev min max Runtime (s) 10 GraSS 0.1 0.168 14 × 10−3 0.167 0.170 495.122 0.5 0.155 0.8 × 10−3 0.154 0.156 2669.961 0.75 0.153 0.7 × 10−3 0.153 0.154 4401.213 1.0 0.153 0.6 × 10−3 0.152 0.154 5516.915 Ours 0.152 1.6 × 10−3 0.150 0.154 0.440 250 GraSS 0.1 0.081 0.8 × 10−3 0.080 0.082 462.085 0.5 0.072 0.4 × 10−3 0.072 0.073 2517.515 0.75 0.070 0.5 × 10−3 0.070 0.071 4192.601 1.0 0.069 0.5 × 10−3 0.069 0.070 5263.651 Ours 0.065 0.8 × 10−3 0.064 0.067 0.552 Table: between ours and GraSS on a random sample (n = 500) of ego-gplus. 29 / 32
36. 36. How good is the query answering approximation? Experiment: run queries on summaries of diﬀerent size k, and measure the error; Queries: Adjacency, Degree, Clustering Coeﬃcient (Triangles) Results: •The error decreases very rapidly as k increases; •Adjacency is well estimated on average, but with large stddev and max when two supernodes are large and have very few edges connecting their nodes. •Degree is extremely well approximated, with small stddev; •Triangle counts are well approximated when k is not too small w.r.t. |V |, but quality decreases very fast otherwise. 30 / 32
37. 37. What did we learn from the experiments? A supernode is allocated for a single node: this seems suboptimal; The sweet spot for the number of supernodes is not easy to ﬁnd; Query performance is just ok, not great; Question: Was this just a theoretical exercise? Answer: Maybe not, but we aren’t “there” yet in graph summarization. 31 / 32
38. 38. What did I tell you about? The ﬁrst constant-factor approximation algorithm for graph summarization; We considered diﬀerent error measures, including the cut norm, related to Szemeredi’s Regularity Lemma; We show a practical application of highly theoretical concepts; The algorithm crushes the competition under any possible metric; Experimental inspections of built summaries opens more questions about summaries and error metrics; Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 32 / 32