Big data matrix factorizations and Overlapping community detection in graphs
Upcoming SlideShare
Loading in...5
×
 

Big data matrix factorizations and Overlapping community detection in graphs

on

  • 88 views

In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.

In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.

Statistics

Views

Total Views
88
Views on SlideShare
88
Embed Views
0

Actions

Likes
1
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data matrix factorizations and Overlapping community detection in graphs Big data matrix factorizations and Overlapping community detection in graphs Presentation Transcript

  • Big data matrix factorizations and Overlapping community detection in graphs. David F. Gleich! Purdue University! Joint work with Paul Constantine, Austin Benson, Jason Lee, Jeremy Templeton, Yangyang Hou, C. Seshadhri Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF, and DOE ASCR award Code bit.ly/dgleich-codes!
  • 2 A From tinyimages" collection Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000) regression and! general linear models! with many samples! block iterative methods panel factorizations approximate kernel k-means big-data SVD/PCA! Used in David Gleich · Purdue
  • A graphical view of the MapReduce programming model David Gleich · Purdue 3 data Map data Map data Map data Map key value key value key value key value key value key value () Shuffle key value value dataReduce key value value value dataReduce key value dataReduce Map tasks read batches of data in parallel and do some initial filtering Reduce is often where the computation happens Shuffle is a global comm. like group-by or MPIAlltoall
  • PCA of 80,000,000" images 4/22 A 80,000,000images 1000 pixels First 16 columns of V as images David Gleich · Purdue Constantine & Gleich, MapReduce 2010. 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Principal Components Fractionofvariance 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Principal Components Fractionofvariance 0 0 0 0 Fractionofvariance 0 0 0 0 Fractionofvariance Figure 5: The 16 most impo nent basis functions (by row
  • Regression with 80,000,000 images The goal was to approx. how much red there was in a picture from the value of the grayscale pixels only. We get a measure of how much “redness” each pixel contributes to the whole. via time and per- ates (for on), split file d by test the r in final size pers 1000 h is the hav- final the sum of red-pixel values in each image as a linear combi- nation of the gray values in each image. Formally, if ri is the sum of the red components in all pixels of image i, and Gi,j is the gray value of the jth pixel in image i, then we wanted to find min q i (ri ≠ q j Gi,jsj)2 . There is no particular im- portance to this regression problem, we use it merely as a demonstration. The coe cients sj are dis- played as an image at the right. They reveal regions of the im- age that are not as important in determining the overall red component of an image. The color scale varies from light- blue (strongly negative) to blue (0) and red (strongly positive). The computation took 30 min- utes using the Dumbo frame- work and a two-iteration job with 250 intermediate reducers. We also solved a principal component problem to find a principal component basis for each image. Let G be matrix of Gi,j’s from the regression and let ui be the mean of the ith A 80,000,000images 1000 pixels David Gleich · Purdue 5
  • Models and algorithms for high performance ! matrix and network computations David Gleich · Purdue 6 1 error 1 std 0 2 (b) Std, s = 0.39 cm 10 error 0 0 10 std 0 20 (d) Std, s = 1.95 cm model compared to the prediction standard de- bble locations at the final time for two values of = 1.95 cm. (Colors are visible in the electronic approximately twenty minutes to construct using s. ta involved a few pre- and post-processing steps: m Aria, globally transpose the data, compute the nd errors. The preprocessing steps took approx- recise timing information, but we do not report Tensor eigenvalues" and a power method FIGURE 6 – Previous work from the PI tackled net- work alignment with ma- trix methods for edge overlap: i j j0 i0 OverlapOverlap A L B This proposal is for match- ing triangles using tensor methods: j i k j0 i0 k0 TriangleTriangle A L B t r o s. g n. o n s s- g maximize P ijk Tijk xi xj xk subject to kxk2 = 1 where ! ensures the 2-norm [x(next) ]i = ⇢ · ( X jk Tijk xj xk + xi ) SSHOPM method due to " Kolda and Mayo Big data methods SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12 Network alignment ICDM ‘09, SC ‘11, TKDE ‘13 Fast & Scalable" Network centrality SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, … Data clustering WSDM ‘12, KDD ‘12, CIKM ’13 … Ax = b min kAx bk Ax = x Massive matrix " computations on multi-threaded and distributed architectures
  • PCA of 80,000,000" images 7/22 A 80,000,000images 1000 pixels X MapReduce Post Processing Zero" mean" rows TSQR R SVD   V First 16 columns of V as images Top 100 singular values (principal 
 components) David Gleich · Purdue Constantine & Gleich, MapReduce 2010.
  • Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute  colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec. David Gleich · Purdue 8
  • How to store tall-and-skinny matrices in Hadoop David Gleich · Purdue 9 A1 A4 A2 A3 A4 A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array " for a row (or b x n block) Each submatrix Ai is an " the input to a map task.
  • 10 0 10 5 10 10 10 15 10 20 10 −15 10 −10 10 −5 10 0 10 5 Numerical stability was a problem for prior approaches 10 Condition number norm(QTQ–I) AR-1 AR-1 + " iterative refinement 4. Direct TSQR Benson, Gleich, " Demmel, BigData’13 Prior work 1. Constantine & Gleich, MapReduce 2011 2. Benson, Gleich, Demmel, BigData’13 Previous methods couldn’t ensure that the matrix Q was orthogonal David Gleich · Purdue 3. Benson, Gleich, Demmel, BigData’13
  • A1 A2 A3 A1 A2 qr Q2 R2 A3 qr Q3 R3 A4 qr Q4A4 R4 emit A5 A6 A7 A5 A6 qr Q6 R6 A7 qr Q7 R7 A8 qr Q8A8 R8 emit Mapper 1 Serial TSQR R4 R8 Mapper 2 Serial TSQR R4 R8 qr Q emit R Reducer 1 Serial TSQR Algorithm Data Rows of a matrix Map QR factorization of rows Reduce QR factorization of rows Communication avoiding QR (Demmel et al. 2008) " on MapReduce (Constantine and Gleich, 2011) 11 David Gleich · Purdue
  • More about how to " compute a regression A min kAx bk2 = min X i ( X j Aij xj bi )2 b A1 A2 A3 A1 A2 qr Q2 R2 A3 qr A4 Mapper 1 Serial TSQR b2 = Q2 T b1 b1 David Gleich · Purdue 12
  • Too many maps cause too much data to one reducer! Each image is 5k. Each HDFS block has " 12,800 images. 6,250 total blocks. Each map outputs " 1000-by-1000 matrix One reducer gets a 6.25M- by-1000 matrix (50GB) David Gleich · Purdue 13
  • Too many maps cause too much data to one reducer! S(1) A A1 A2 A3 A3 R1 map Mapper 1-1 Serial TSQR A2 emit R2 map Mapper 1-2 Serial TSQR A3 emit R3 map Mapper 1-3 Serial TSQR A4 emit R4 map Mapper 1-4 Serial TSQR shuffle S1 A2 reduce Reducer 1-1 Serial TSQR S2 R2,2 reduce Reducer 1-2 Serial TSQR R2,1 emit emit emit shuffle A2S3 R2,3 reduce Reducer 1-3 Serial TSQR emit Iteration 1 Iteration 2 identitymap A2S(2) Rreduce Reducer 2-1 Serial TSQR emit David Gleich · Purdue 14
  • The rest of the talk" Full TSQR code in hadoopy 15 David Gleich · Purdue import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)
  • Non-negative matrix factorization David Gleich · Purdue 16 (b) NMF (c) Manifold Learning xy z xy Projection on 1st NNF 2ndNNF First manifold parameter Second Find W, H 0 where A ⇡ WH NMF ! Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • There are good algorithms for separable NMF that avoid alternating between W, H. David Gleich · Purdue 17 Find W, H 0 where A ⇡ WH NMF ! Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • Separable NMF algorithms 1.  Find the columns of A. 2.  Find the values of W. David Gleich · Purdue 18 (b) NMF (c) Manifold Learning xy z x y NNF cond Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • Separable NMF algorithms are really geometry 1.  Find the columns of A. " Equiv. to “Find the extreme points of a convex set.” 2.  These are preserved under linear transformations David Gleich · Purdue 19 (b) NMF (c) Manifold Learning xy z x y NNF cond Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • We use our tall-and-skinny QR to get a orthogonal transformation to make the problem easily solvable. David Gleich · Purdue 20
  • David Gleich · Purdue 21 A U S VT SVD NMF AK H 1. Compute QR using TSQR method 2. Run a separable NMF method on SVT 3. Find H by solving a small non-negative least-squares problem in each column. These are tiny.
  • All of the hard analysis is on the small dimension of the matrix, which makes this very useful in practice. David Gleich · Purdue 22
  • Our methods vs. the competition David Gleich · Purdue 23 Figure 1: Relative error in the separable factoriza- ion as a function of nonnegative rank (r) for the hree algorithms. The matrix was synthetically gen- erated to be separable. SPA and GP capture all of he true extreme columns when r = 20 (where the esidual is zero). Since we are using the greedy vari- Figure 2: First 20 extreme columns selected by XRAY, and GP along with the true column in the synthetic matrix generation. A mar present for a given column index if and only column is a selected extreme column. SPA an capture all of the true extreme columns. Sin gure 1: Relative error in the separable factoriza- n as a function of nonnegative rank (r) for the ree algorithms. The matrix was synthetically gen- ated to be separable. SPA and GP capture all of e true extreme columns when r = 20 (where the idual is zero). Since we are using the greedy vari- t of XRAY, it takes r = 21 to capture all of the Figure 2: First 20 extreme columns selected by SPA, XRAY, and GP along with the true columns used in the synthetic matrix generation. A marker is present for a given column index if and only if that column is a selected extreme column. SPA and GP capture all of the true extreme columns. Since we are using the greedy variant of XRAY, it does se- 200 million rows, 200 columns, separation rank 20.
  • David Gleich · Purdue 24 Nonlinear heat transfer model in random media Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation We did 8192 runs (128 samples of bubble locations, 64 bubble radii) 4.5 TB of data in Exodus II (NetCDF) Applyheat Lookattemperature https://www.opensciencedatacloud.org/ publicdata/heat-transfer/
  • 0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bubble radius Proportionoftemp.>475K 15 20 25 0 0.5 1 True ROM RS David Gleich · Purdue 25 0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bubble radius Proportionoftemp.>475K Insulator regime Non-insulator regime
  • David Gleich · Purdue 26 A Each simulation is a column 5B-by-64 matrix 2.2TB U S VT SVD NMF AK H Run a “standard” NMF " algorithm on SVT
  • David Gleich · Purdue 27 Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column n H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients. Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column in H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients. Figure 8: First 10 extreme columns selected by SPA, XRAY, and GP for the heat transfer simulation Figure 10: Value of H matrix for columns 1 through 34 for the SPA algorithm on the heat transfer sim-
  • A bunch of papers Constantine & Gleich, MapReduce 2011 Benson, Gleich & Demmel, BigData 2013 Benson, Gleich, Rawja & Lee, arXiv 2014 Constantine, Gleich, Hou, Templeton, SISC In- press Code online: github.com/arbenson David Gleich · Purdue 28
  • Next talk 1.  Personalized PageRank" based community detection 2.  The best community detection algorithm? David Gleich · Purdue 29
  • A community is a set of vertices that is denser inside than out. David Gleich · Purdue 30
  • 250 node GEOP network in 2 dimensions 31
  • 250 node GEOP network in 2 dimensions 32
  • We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", " follow a random edge 2.  with probability 1-𝛼, ", " restart at a seed aka random surfer aka random walk with restart unique stationary distribution David Gleich · Purdue 33
  • Personalized PageRank community detection 1.  Given a seed, approximate the stationary distribution. 2.  Extract the community. Both are local operations. David Gleich · Purdue 34
  • Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community (S) = cut(S) min vol(S), vol( ¯S) (edges leaving the set) (total edges in the set) David Gleich · Purdue cut(S) = 7 vol(S) = 33 vol( ¯S) = 11 (S) = 7/11 35
  • Andersen- Chung-Lang" personalized PageRank community theorem" [Andersen et al. 2006]! Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast. David Gleich · Purdue 36
  • # G is graph as dictionary-of-sets! alpha=0.99! tol=1e-4! ! x = {} # Store x, r as dictionaries! r = {} # initialize residual! Q = collections.deque() # initialize queue! for s in seed: ! r(s) = 1/len(seed)! Q.append(s)! while len(Q) > 0:! v = Q.popleft() # v has r[v] > tol*deg(v)! if v not in x: x[v] = 0.! x[v] += (1-alpha)*r[v]! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u! if u not in r: r[u] = 0.! if r[u] < len(G[u])*tol and ! r[u] + mass >= len(G[u])*tol:! Q.append(u) # add u to queue if large! r[u] = r[u] + mass! r[v] = mass*len(G[v]) ! David Gleich · Purdue 37
  • Problem 1, which seeds? David Gleich · Purdue 38
  • Whang-Gleich-Dhillon, CIKM2013 [upcoming…] 1.  Extract part of the graph that might have overlapping communities. 2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus. 3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers. David Gleich · Purdue 39
  • Student Version of MATLAB (a) AstroPh 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (percentage) MaximumConductance egonet graclus centers spread hubs random bigclam (d) Flickr Flickr social network 2M vertices" 22M edges We can cover 95% of network with communities of cond. ~0.15. David Gleich · Purdue A good partitioning helps" 40 flickr sample - 2M verts, 22M edges
  • F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 DBLP demon bigclam graclus centers spread hubs random egonet 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Figure 3: F1 and F2 measures comparing our algorithmic co indicates better communities. Run time Our seed Using datasets from " Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art overlapping community detection methods. " Even randomly seeded! David Gleich · Purdue And helps to find real-world overlapping communities too. 41
  • Seed Set Expansion Carefully select seeds Greedily expand communities around the seed sets The algorithm Filtering Phase Seeding Phase Seed Set Expansion Phase Propagation Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44) David Gleich · Purdue 42
  • David Gleich · Purdue 43 Filtering Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (9/44) Filtering Phase
  • David Gleich · Purdue 44 Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (16/44) Seed Set Expansion Phase Run clustering, and choose centers or pick an independent set of high degree nodes Run personalized PageRank
  • David Gleich · Purdue 45 Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (28/44) Propagation Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (30/44) We can prove that this only improves the objective
  • Conclusion & Discussion & PPR community detection is fast " [Andersen et al. FOCS06] PPR communities look real " [Abrahao et al. KDD2012; Zhu et al. ICML2013] Partitioning for seeding yields " high coverage & real communities. “Caveman” communities?! ! ! ! David Gleich · Purdue 46 Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample ! bit.ly/18khzO5! ! Egonet seeding bit.ly/dgleich-code! References Best conductance cut at intersection of communities?