Tri DoanI am surprised that this approach (map reduce) could be compared to GraphChi, PowerGraph since there are studies that mapreduce might not suite for computational graphs.
1.
Massive MapReduceMatrix Computations &Multicore GraphAlgorithmsDAVID F. GLEICHCOMPUTER SCIENCEPURDUE UNIVERSITY 1 David Gleich · Purdue
2.
i “imvol3” — 2007/7/25 — 21:25 — page 257 — #1 Internet Mathematics Vol. 3, No. 3: 257-294It’s a pleasure … Approximating PersonalizedIntel Intern 2005 in PageRank with Minimal UseApplication Research of Web Graph DataLab in Santa Clara David Gleich and Marzia Polito Could you run your own search engineResulting in one of and crawl the web to compute proximations to the personalized PageRank score of ayou are We focus on your own PageRank vector if webpage. Abstract. In this paper, we consider the problem of calculating fast and amy favorite papers! highly concerned with privacy? to improve speed by limiting the amount of web graph data we need to acc Our algorithms provide both the approximation to the personalized Page as well as guidance in using only the necessary information—and therefo Yes! Theory, Experiments, Implementation! reduce not only the computational cost of the algorithm but also the m memory bandwidth requirements. We report experiments with these alg web graphs of up to 118 million pages and prove a theoretical approxima 2 David Gleich · Purdue for all. Finally, we propose a local, personalized web-search system for a f system using our algorithms.
3.
Massive MapReduce Matrix Computations Yangyang Hou " Purdue, CS A1 Paul G. Constantine " Austin Benson " Joe Nichols" A2 Stanford University James Demmel " A3 UC Berkeley Joe Ruthruff " A4 Jeremy Templeton" Sandia CA Funded by Sandia National Labs CSAR project. 3 David Gleich · Purdue
4.
By 2013(?) all Fortune 500companies will have a datacomputer 4 David Gleich · Purdue
5.
Data computers I’ve worked with … Magellan Cluster @ ! Student Cluster @ ! Nebula Cluster @ ! NERSC! Stanford! Sandia CA!128GB/core storage, " 3TB/core storage, " 2TB/core storage, "80 nodes, 640 cores, " 11 nodes, 44 cores, " 64 nodes, 256 cores, "inﬁniband GB ethernet GB ethernet Cost $30k Cost $150k These systems are good for working with enormous matrix data! 5 David Gleich · Purdue
6.
How do you program them? 6 David Gleich · Purdue
7.
MapReduce and"Hadoop overview 7 David Gleich · Purdue
8.
MapReduce in a picture Like an MPI all-to-all In parallel In parallel 8 David Gleich · Purdue
9.
Computing a histogram "A simple MapReduce example 1 1 5Input! 1 1 Output! 15 1! 1 ! 10 shufﬂe 1 9Key ImageId 1 1 1 Key Color 3Value Pixels Map 1 1 Reduce Value " 1 17 1 5 1 1 # of pixels 10 Map(ImageId, Pixels) Reduce(Color, Values) for each pixel emit" emit" Key = Color Key = (r,g,b)" Value = sum(Values) Value = 1 9 David Gleich · Purdue
10.
Why a limited computational model?Data scalability, fault tolerance. Maps M M The last page of a1 M 1 2 136-page error dump. Reduce2 M M M R 3 43 M R4 M M 55 M Shufﬂe The idea ! Bring the computations to the data MR can schedule map After waiting in the queue for a month and " functions without after 24 hours of ﬁnding eigenvalues, " moving data. one node randomly hiccups. 10 David Gleich · Purdue
11.
Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion)A A few columns (under 10,000) regression and general linear models" with many samples From tinyimages" collection Used in block iterative methods panel factorizations simulation data analysis ! big-data SVD/PCA! 11 David Gleich · Purdue
12.
Scientiﬁc simulations as " Tall-and-Skinny matrices Input " Time history"Parameters of simulation s f" ~100GB The simulation as a matrix The simulation as a vector 2 3 time q(x1 , t1 , s) A database parameters tall-and-skinny matrix The database is a very" 6 . . 7 of simulations 6 . 7 6 7 6q(xn , t1 , s)7 6 7 space-by-time 6q(x1 , t2 , s)7 s1 -> f1 space 6 7 f(s) = 6 7 A 6 6 . . . 7 7 s2 -> f2 A 6q(xn , t2 , s)7 6 6 7 7 . 4 . . 5 sk -> fk q(xn , tk , s) 12 David Gleich · Purdue
13.
Model reduction Constantine & Gleich, ICASSP 2012 A Large Scale ExampleNonlinear heat transfer model80k nodes, 300 time-steps104 basis runsSVD of 24m x 104 data matrix 500x reduction in wall clock time(100x including the SVD) 13 David Gleich · Purdue
14.
PCA of 80,000,000" images First 16 columns of V as images1000 pixels R V SVD (principal TSQR components) 80,000,000 images Top 100 A X singular values Zero" mean" rows 14/22 MapReduce Post Processing Constantine & Gleich, MapReduce 2010. David Gleich · Purdue
15.
All these applications need isTall-and-Skinny QR 15 David Gleich · Purdue
16.
the solution of QR is block nor is orthogonal ( ) “normalize” a vQuick review of QRQR Factorization usually genera computing in is upper triangular.Let , real Using QR for regression is given by the solution of 0 A QR is = Q block normalization is orthogonal ( ) “normalize” a vector R usually generalizes to computing in the QR is upper David Gleich (Sandia) triangular. MapReduce 2011Current MapReduce algs use the normal equations 0 AT Cholesky ! RT R 1 = Q A = QR A A Q = AR Rwhich can limit numerical accuracy 16David Gleich (Sandia) MapReduce 2011 4/22 David Gleich · Purdue
17.
There are good MPIimplementations. Why MapReduce? 17 David Gleich · Purdue
18.
Full TSQR code inhadoopy In hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 18 David Gleich (Sandia) MapReduceDavid 2011 Gleich · Purdue 13/22
19.
Tall-and-skinny matrixstorage in MapReduceA : m x n, m ≫ n A1Key is an arbitrary row-id A2Value is the 1 x n array "for a row A3 A4 Each submatrix Ai is an "the input to a map task. 19 David Gleich · Purdue
20.
Numerical stability was a problem for prior approaches Constantine & Gleich, MapReduce 2010 Prior work norm ( QTQ – I )Previous methodscouldn’t ensure AR-1that the matrix Qwas orthogonal Benson, Gleich, Demmel, Submitted AR + " -1 nt Direct TSQR reﬁneme iterative Benson, Gleich, " Demmel, Submitted 105 1020 Condition number 20 David Gleich · Purdue
21.
Communication avoiding QR (Demmel et al. 2008) " on MapReduce (Constantine and Gleich, 2010) Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 “Manual reduce” can make A4 qr emit A4 Q4 R4 it faster by adding a second iteration. A5 A5 qr A6 A6 Q6 R6 Computes only R and not Q Mapper 2 qr A7 A7Serial TSQR Q7 R7 Can get Q via Q = AR-1 with A8 A8 qr Q8 R8 emit another MR iteration. R4 Use the standard R4Reducer 1 Householder method?Serial TSQR qr emit R8 R8 Q R 21 David Gleich · Purdue
22.
Taking care of business bykeeping track of Q 3. Distribute the pieces of Q*1 and form the true Q Mapper 1 Mapper 3 Task 2 R1 Q11 A1 Q1 R1 Q11 R Q1 Q1 R2 Q21 Q output R output R2 R3 Q31 Q21 A2 Q2 Q2 Q2 R4 Q41 R3 Q31 2. Collect R on one A3 Q3 Q3 Q3 node, compute Qs for each piece R4 Q41 A4 Q4 Q4 Q4 1. Output local Q and R in separate ﬁles 22 David Gleich · Purdue
23.
The price is right! Based onperformance model and tests Experiment on 2500 NERSC DirectTSQR is Magellan faster than computer, 80 reﬁnement for … and not any nodes, 640seconds few columns slower for many processors, columns. 80TB disk 500 800M-by-10 7.5B-by-4 150M-by-100 500M-by-50 23 David Gleich · Purdue
24.
Ongoing workMake AR-1 stable with targeted quad-precisionarithmetic to get a numerically orthogonal Q" Performance model says it’s feasible!How to handle more than ~ 10,000 columns? " Some randomized methods?Do we need quad-precision for big-data?"Standard error analysis n 𝜀 to compute sum." I’ve seen this with PageRank computations! 24 David Gleich · Purdue
25.
Multicore Graph " Assefaw Gehraimbem "Algorithms Arif Khan" Alex Pothen" Ryan Rossi" Mem Purdue, CS CPU Mahantesh Halappanavar" Mem PNNL Mem CPU Chen Greif" CPU David Kurokawa" Univ. British Columbia Mohsen Bayati"Funded by DOE CSCAPES Institute grant Amin Saberi"(DE-FC02-08ER25864), NSF CAREER grant Ying Wang (now Google)"1149756-CCF, and the Center for AdaptiveSuper Computing Software Multithreaded Stanford 25Architectures (CASS-MT) at PNNL. David Gleich · Purdue
26.
Network alignment"What is the best way of matching "graph A to B? w v s r t u A B 26 David Gleich · Purdue
27.
the Figure 2. The NetworkBLAST local network alignment algorithm. Given two inputs) orodes lem Network alignment" networks, a network alignment graph is constructed. Nodes in this graph correspond to pairs of sequence-similar proteins, one from each species, and edges correspond to conserved interactions. A search algorithm identiﬁes highly similar subnetworks that follow a prespeciﬁed interaction pattern. Adapted from Sharan and Ideker.30n the ent;nied ped lem net- one oneplest ying einsome the be-d as aphever, ap- From Sharan and Ideker, Modeling cellular machinery through biologicalrked network comparison. Nat. Biotechnol. 24, 4 (Apr. 2006), 427–433. 27 , we Figure 3. Performance comparison of computational approaches.mon- David Gleich · Purdue
28.
Network alignment"What is the best way of matching "graph A to B using only edges in L? w v s r wtu t u A L B 28 David Gleich · Purdue
29.
Network alignment"Matching? 1-1 relationship"Best? highest weight and overlap w v Overlap s r wtu t u A L B 29 David Gleich · Purdue
30.
Our contributionsA new belief propagation method (Bayati et al. 2009, 2013)"Outperformed state-of-the-art PageRank and optimization-based heuristic methodsHigh performance C++ implementations (Khan et al. 2012)"40 times faster (C++ ~ 3, complexity ~ 2, threading ~ 8)"5 million edge alignments ~ 10 sec"www.cs.purdue.edu/~dgleich/codes/netalignmc 30 David Gleich · Purdue
32.
Each iteration involves Let x[i] be the score forMatrix-vector-ish computations each pair-wise match in Lwith a sparse matrix, e.g. sparsematrix vector products in a semi- for i=1 to ...ring, dot-products, axpy, etc. update x[i] to y[i]Bipartite max-weight matching compute ausing a different weight vector at max-weight match with yeach iteration update y[i] to x[i]" (using match in MR)No “convergence” "100-1000 iterations 32 David Gleich · Purdue
33.
The methodsEach iteration involves! Belief Propagation! ! Listing 2. A belief-propagation message passing procedure for network alignment. See the text for a description of othermax and round heuristic. D 1 y(0) = 0, z(0) = 0, d(0) = 0, S(k) = 0 tMatrix-vector-ish computations ! 2 3 for k = 1 to niter T F = bound0, [ S + S(k) ] Step 1: compute F O swith a sparse matrix, e.g. sparse 4 d = ↵w + Fe Step 2: compute d a ! 5 y(k) = d othermaxcol(z(k 1) ) Step 3: othermax imatrix vector products in a semi- 6 z(k) = d othermaxrow(y(k 1) ) i h S(k) = diag(y(k) + z(k) d)S F Step 4: update S ! 7ring, dot-products, axpy, etc. 8 (y(k) , z(k) , S(k) ) k (y(k) , z(k) , S(k) )+ O a 9 (1 k )(y(k 1) , z(k 1) , S(k 1) ) Step 5: damping e 10 11 ! round heuristic (y(k) ) Step 6: matching round heuristic (z(k) ) Step 6: matching I 12 endBipartite max-weight matching return y(k) or z(k) with the largest objective value ! 13 t pusing a different weight vector at m ! weach iteration interpretation, the weight vectors are usually called messages as they communicate the “beliefs” of each “agent.” In this A particular problem, the neighborhood of an agent represents 33 all of the other edges in graph L incident on the same vertex s in graph A (1st vector), all edges in L incident on the same David in graph BPurdue vertex Gleich · (2nd vector), or the edges in L that are ﬁ “
34.
The NEW methods Each iteration involves! Belief Propagation! el ! Listing 2. A belief-propagation message passing procedure for network alignment. See the text for a description of othermax and round heuristic. D lParal (0) (0) (0) (k) y = 0, z = 0, d = 0, S = 0 1 t ! F = bound Matrix-vector-ish computations for k = 1 to n [ S + S ] Step 1: compute F 2 3 iter 0, (k) T O s with a sparse matrix, e.g. sparse d = ↵wd+ Fe Step 2: compute dStep 3: othermax 4 a ! y = d othermaxrow(y )) = 5 (k) othermaxcol(z (k 1) i matrix vector products in a semi- z 6 (k) (k) (k 1) i h S = diag(y + z d)S F Step 4: update S (k) (k) ! (y , z , S ) (y , z , S )+ 7 ring, dot-products, axpy, etc. 8 (k) (k) (k) k (k) (k) (k) O a 9 (1 k )(y(k 1) , z(k 1) , S(k 1) ) Step 5: damping e 10 11 ! round heuristic (y(k) ) Step 6: matching round heuristic (z(k) ) Step 6" I 12 end approx matching Approximate bipartite max- return y or z with the largest objective value (k) (k) ! 13 t p weight matching is used here m ! w instead! interpretation, the weight vectors are usually called messages as they communicate the “beliefs” of each “agent.” In this A particular problem, the neighborhood of an agent represents 34 all of the other edges in graph L incident on the same vertex s in graph A (1st vector), all edges in L incident on the same David in graph BPurdue vertex Gleich · (2nd vector), or the edges in L that are ﬁ “
35.
MR Approximation doesn’t hurt the between the Library of Congress r 0.2 ApproxMRpedia categories (lcsh-wiki). While BP e hierarchical tree, they also have belief propagation algorithm ApproxBP r types of relationships. Thus we 0 0 5 10 15 20 l graphs. The second problem is an expected degree of noise in L (p ⋅ n)rary of Congress subject headingsFrench National Library: Rameau. 1d weights in L are computed via a heading strings (and via translated of correct matchau). These problems are larger than 0.8 BP a Fraction fraction correct indis nd App tingu roxB NMENT WITH APPROXIMATE 0.6 isha P ATCHING ble are ss the question: how does the be- 0.4 d the BP method change when wematching procedure from Section V MR 0.2 ApproxMR step in each algorithm? Note that BP ching in the ﬁrst step of Klau’s ApproxBPch) because the problems in each 0we parallelize over perturb onealso 0 5 10 15 20 Randomly rows. Note expected degree of noise in L (p ⋅ n) is much more integral to Klau’s B power-law graph to get A, The amount of random-ness in L in average expected degreeedure. Generate L by the true-we Fig. 2. Alignment with a power-law graph shows the large effect that For the BP procedure, ing problem to evaluate the quality approximate rounding can have on solutions from Klau’s method (MR). With 35 match + random edgesKlau’s method, the results of the that method, using exact rounding will yield the identity matching for all David Gleich · Purdue problems (bottom ﬁgure), whereas using the approximation results in over a
36.
A local dominating edgemethod for bipartite matching j i The method guarantees r s • ½ approximation • maximal matching based on work by Preis (1999), Manne and wtu Bisseling (2008), and t u Halappanavar et al (2012) A L BA locally dominating edge is an edgeheavier than all neighboring edges.For bipartite Work on smaller side only 36 David Gleich · Purdue
37.
A local dominating edgemethod for bipartite matching j Queue all vertices i r s Until queue is empty! In Parallel over vertices" Match to heavy edge and if there’s a conﬂict, wtu u check the winner, and t ﬁnd an alternative for A L B the loser Add endpoint of non-A locally dominating edge is an edge dominating edges toheavier than all neighboring edges. the queueFor bipartite Work on smaller side only 37 David Gleich · Purdue
38.
A local dominating edgemethod for bipartite matching j i Customized ﬁrst iteration r s (with all vertices) Use OpenMP locks to update choices wtu t u Use sync_and_fetch_add A L B for queue updates.A locally dominating edge is an edgeheavier than all neighboring edges.For bipartite Work on smaller side only 38 David Gleich · Purdue
39.
Remaining multi-threadingprocedures are straightforwardStandard OpenMP for matrix-computations" use schedule=dynamic to handle skewWe can batch the matching procedures in theBP method for additional parallelism for i=1 to ... update x[i] to y[i] save y[i] in a buffer when the buffer is full compute max-weight match for all in buffer and save the best 39 David Gleich · Purdue
40.
Performance evaluation(2x4)-10 core Intel E7-8870, 2.4 GHz (80-cores)16 GB memory/proc (128 GB)Scaling study Mem Mem Mem Mem1. Thread binding " CPU CPU CPU CPU scattered vs. compact CPU CPU CPU CPU2. Memory binding " Mem Mem Mem Mem interleaved vs. bind 40 David Gleich · Purdue
41.
Scaling BP with no batching lcsh-rameau, 400 iterations 25 scatter and interleave 20 Speedup 15 115 seconds for 40-thread 10 5 1450 seconds for 1-thread 0 0 20 40 60 80 Threads 41 David Gleich · Purdue
42.
Ongoing workBetter memory handling! " numactl, affinity insufficient for full scalingBetter models!" These get to be much bigger computations.Distributed memory." Trying to get an MPI version, looking into GraphLab 42 David Gleich · Purdue
43.
PageRank was created byageRank details Google to rank by Google PageRank web-pages 3 3 2 5 The Model 0 0 0 3 2 1/ 6 1/ 2 0 2 5 6 1/ 6 0 0 1/ 3 0 0 7 1. follow edges uniformlyPwith j 0 ! 6 probability1/ 3, 0 0 7 eT P=eT 1/ 6 1/ 2 0 0 0 4 4 4 1/ 6 0 1/ 2 0 and 5 1/ 6 0 1/ 2 1/ 3 0 1 2. randomly jump 0 with probability 1 6 | 1/ 6 0 {z 0 1 } 0 1 6 1 , we’ll assume everywhere is P equally likely T 0 “jump” ! v = [ 1 ... 1 ] n n eT v=1 î ó Markov chain P + (1 )ve T x=x The places we ﬁnd the unique x ) j 0, eT x = 1. are im- surfer most often Linear system ( portant pages. P)x = (1 )v 43/40 Ignored dangling nodes patched back to v algorithms later Gleich, Purdue UTRC Seminar David
45.
Multicore PageRank… similar story … Serialized preprocessingParallelize the linear algebra via an "asynchronous Gauss-Seidel iterative method~10x scaling on same (80-core) machine "(1M nodes, 15M edges, synthetic) 45 David Gleich · Purdue
46.
Questions? Papers on my webpage www.cs.purdue.edu/homes/dgleich Codes github.com/arbenson/mrtsqrwww.cs.purdue.edu/homes/dgleich/codes/netalignmc github.com/dgleich/prpack 46 David Gleich · Purdue
Views
Actions
Embeds 0
Report content