Tall and Skinny QRs in MapReduce

1,356 views

Published on

This talk is a new update based on some of our recent results on doing Tall and Skinny QRs in MapReduce. In particular, the "fast" iterative refinement approximation based on a sample is new.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,356
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
27
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Tall and Skinny QRs in MapReduce

  1. 1. Tall-and-skinny !QRs and SVDs inMapReduceDavid F. Gleich!Computer Science!Purdue University!A1A4A2A3A4Yangyang Hou "Purdue, CSPaul G. Constantine "Austin Benson "Joe Nichols"Stanford UniversityJames Demmel "UC BerkeleyJoe Ruthruff "Jeremy Templeton"Sandia CAMainly funded by Sandia National LabsCSAR project, recently by NSF CAREER,"and Purdue Research Foundation. ICML2013David Gleich · Purdue 1
  2. 2. 2AFrom tinyimages"collectionICML2013Tall-and-Skinnymatrices(m ≫ n) Many rows (like a billion)A few columns (under 10,000)regression and!general linear models!with many samples!block iterative methodspanel factorizationsapproximate kernel k-means big-data SVD/PCA!Used inDavid Gleich · Purdue
  3. 3. A matrix A : m × n, m ≥ n"is tall and skinny when O(n2) "work and storage is “cheap”compared to m.ICML2013David Gleich · Purdue 3-- Austin Benson
  4. 4. Quick review of QRICML20134andia)   is given bythe solution of   QR is block normalization“normalize” a vectorusually generalizes tocomputing    in the QRA Q, realorthogonal (   )upper triangular.0R=MapReduce 2011David Gleich · Purdue Let A : m × n, m ≥ n, realA = QRQ is m × n orthogonal (QT Q = I )R is n × n upper triangular
  5. 5.   Tall-and-skinny SVD and RSVDLet A : m × n, m ≥ n, realA = U𝞢VTU is m × n orthogonal 𝞢 is m × n nonneg, diag.V is n × n orthogonalICML2013David Gleich · Purdue 5A QTSQRR VSVD
  6. 6. There are good MPIimplementations.What’s left to do? ICML20136David Gleich · Purdue
  7. 7. Moving data to an MPI clustermay be infeasible or costlyICML2013David Gleich · Purdue 7
  8. 8. Data computers I’ve used ICML20138Nebula Cluster @ !Sandia CA!2TB/core storage, "64 nodes, 256 cores, "GB ethernetCost $150kThese systems are good for working withenormous matrix data!Student Cluster @ !Stanford!3TB/core storage, "11 nodes, 44 cores, "GB ethernetCost $30kMagellan Cluster @ !NERSC!128GB/core storage, "80 nodes, 640 cores, "infinibandDavid Gleich · Purdue
  9. 9. By 2013(?) all Fortune 500companies will have a datacomputerICML20139David Gleich · Purdue
  10. 10. How do you program them?ICML201310David Gleich · Purdue
  11. 11. A graphical view of the MapReduceprogramming modelDavid Gleich · Purdue 11dataMapdataMapdataMapdataMapkeyvaluekeyvaluekeyvaluekeyvaluekeyvaluekeyvalue()ShufflekeyvaluevaluedataReducekeyvaluevaluevaluedataReducekeyvalue dataReduceICML2013Map tasks read batches of data inparallel and do some initial filteringReduce is often where thecomputation happensShuffle is aglobal comm.like group-byor MPIAlltoall
  12. 12. MapReduce looks like the firststep of any MPI program thatloads data.Read data -> assign to processReceive data -> do computeICML2013David Gleich · Purdue 12
  13. 13. Still, isn’t this easy to do?ICML2013David Gleich · Purdue 13Current MapReduce algs use the normal equationsA = QR ATACholesky! RTR Q = AR 1A1A4A2A3A4Map!Aii to AiTAi Reduce!RTR = Sum(AiTAi)Map 2!Aii to Ai R-1Two problems!R inaccurate if A ill-conditionedQ not numerically orthogonal(Householder assures this)
  14. 14. Numerical stability was aproblem for prior approachesICML201314Condition number1020105norm(QTQ–I)AR-1Prior workPrevious methodscouldn’t ensurethat the matrix Qwas orthogonal David Gleich · Purdue
  15. 15. Four things that are better1.  A simple algorithm to compute R accurately.(but doesn’t help get Q orthogonal).2.  “Fast algorithm” to get Q numericallyorthogonal in most cases.3.  Multi-pass algorithm to get Q numericallyorthogonal in virtually all cases.4.  A direct algorithm for a numericallyorthogonal Q in all cases.ICML2013David Gleich · Purdue 15Constantine & Gleich MapReduce 2011Benson, Gleich & Demmel arXiv 2013
  16. 16. Numerical stability was aproblem for prior approachesICML201316Condition number1020105norm(QTQ–I)AR-1AR-1 + "iterative refinement Direct TSQRBenson, Gleich, "Demmel, SubmittedPrior workConstantine & Gleich,MapReduce 2011Benson, Gleich,Demmel, Submitted Previous methodscouldn’t ensurethat the matrix Qwas orthogonal David Gleich · Purdue
  17. 17. How to store tall-and-skinnymatrices in HadoopDavid Gleich · Purdue 17A1A4A2A3A4A : m x n, m ≫ nKey is an arbitrary row-idValue is the 1 x n array "for a row (or b x n block)Each submatrix Ai is an "the input to a map task.ICML2013
  18. 18. MapReduce is great for TSQR!!You don’t need  ATAData A tall and skinny (TS) matrix by rowsInput 500,000,000-by-50 matrix"Each record 1-by-50 row"HDFS Size 183.6 GBTime to compute read A 253 sec. write A 848 sec.!Time to compute  R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "Time to compute Q in qr(A) 3090 sec. (numerically stable)!ICML2013David Gleich · Purdue 18
  19. 19. Communication avoiding QR(Demmel et al. 2008)Communication avoidingTSQRDemmel et al. 2008. Communicating avoiding parallel and sequential QR.First, do QRfactorizationsof each localmatrix   Second, computea QR factorizationof the new “R”19ICML2013David Gleich · Purdue
  20. 20. Serial QR factorizations!(Demmel et al. 2008)Fully serialTSQRDemmel et al. 2008. Communicating avoiding parallel and sequential QR.Compute QR of    ,read    , update QR, …20ICML2013David Gleich · Purdue
  21. 21. A1A2A3A1A2qrQ2 R2A3qrQ3 R3A4qrQ4A4R4emitA5A6A7A5A6qrQ6 R6A7qrQ7 R7A8qrQ8A8R8emitMapper 1Serial TSQRR4R8Mapper 2Serial TSQRR4R8qrQemitRReducer 1Serial TSQRAlgorithmData Rows of a matrixMap QR factorization of rowsReduce QR factorization of rowsCommunication avoiding QR (Demmel et al. 2008) !on MapReduce (Constantine and Gleich, 2011)ICML201321David Gleich · Purdue
  22. 22. Too many maps cause toomuch data to one reducer!Each image is 5k.Each HDFS block has "12,800 images.6,250 total blocks.Each map outputs "1000-by-1000 matrixOne reducer gets a 6.25M-by-1000 matrix (50GB)From tinyimagescollection"80,000,000images1000 pixelsICML2013David Gleich · Purdue 22
  23. 23. S(1)AA1A2A3A3R1mapMapper 1-1Serial TSQRA2emitR2mapMapper 1-2Serial TSQRA3emitR3mapMapper 1-3Serial TSQRA4emitR4mapMapper 1-4Serial TSQRshuffleS1A2reduceReducer 1-1Serial TSQRS2R2,2reduceReducer 1-2Serial TSQRR2,1emitemitemitshuffleA2S3R2,3reduceReducer 1-3Serial TSQRemitIteration 1 Iteration 2identitymapA2S(2)RreduceReducer 2-1Serial TSQRemitICML2013David Gleich · Purdue 23
  24. 24. Our robust algorithm tocompute R.ICML2013David Gleich · Purdue 24
  25. 25. Full TSQR code in hadoopyIn hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapperdef compress(self):R = numpy.linalg.qr(numpy.array(self.data),r)# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, rowdef mapper(self,key,value):self.collect(key,value)def reducer(self,key,values):for value in values: self.mapper(key,value)if __name__==__main__:mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)David Gleich (Sandia) 13/22MapReduce 201125ICML2013David Gleich · Purdue
  26. 26. Numerical stability was aproblem for prior approachesICML201326Condition number1020105norm(QTQ–I)AR-1AR-1 + "iterative refinement Direct TSQRBenson, Gleich, "Demmel, SubmittedPrior workConstantine & Gleich,MapReduce 2011Benson, Gleich,Demmel, Submitted Previous methodscouldn’t ensurethat the matrix Qwas orthogonal David Gleich · Purdue
  27. 27. Iterative refinement helpsICML2013David Gleich · Purdue 27A1A4Q1R-1Mapper 1A2 Q2A3 Q3A4 Q4RTSQRDistributeR R-1R-1R-1Iterative refinement is like using Newton’s method to solve Ax = b. There’s a famousquote that “two iterations of iterative refinement are enough” attributed to Parlett; TSQRQ1A4Q1T-1Mapper 2Q2 Q2Q3 Q3Q4 Q4TDistributeTT-1T-1T-1
  28. 28. Numerical stability was aproblem for prior approachesICML201328Condition number1020105norm(QTQ–I)AR-1AR-1 + "iterative refinement Direct TSQRBenson, Gleich, "Demmel, SubmittedPrior workConstantine & Gleich,MapReduce 2011Benson, Gleich,Demmel, Submitted Previous methodscouldn’t ensurethat the matrix Qwas orthogonal David Gleich · Purdue
  29. 29. What if iterative refinement istoo slow?ICML2013David Gleich · Purdue 29A1A4Q1R-1Mapper 1A2 Q2A3 Q3A4 Q4SSampleComputeQR,"distributeR R-1R-1R-1TSQRA1A4Q1T-1Mapper 2A2 Q2A3 Q3A4 Q4TDistributeTRT-1T-1T-1Based on recent work by Ipsen et al. on approximating QR with a randomsubset of rows. Also assumes that you can get a subset of rows “cheaply” –possible, but nontrivial in Hadoop.R-1R-1R-1R-1Estimate the “norm” by S
  30. 30. Orthogonality of the sampledsolutionICML201330Condition number1020105norm(QTQ–I)AR-1Direct TSQRBenson, Gleich, "Demmel, SubmittedPrior workConstantine & Gleich,MapReduce 2011Benson, Gleich,Demmel, Submitted David Gleich · Purdue
  31. 31. Four things that are better1.  A simple algorithm to compute R accurately.(but doesn’t help get Q orthogonal).2.  Fast algorithm to get Q numericallyorthogonal in most cases.3.  Two pass algorithm to get Q numericallyorthogonal in virtually all cases.4.  A direct algorithm for a numericallyorthogonal Q in all cases.ICML2013David Gleich · Purdue 31Constantine & Gleich MapReduce 2011Benson, Gleich & Demmel arXiv 2013
  32. 32. Recreate Q by storing thehistory of the factorizationICML2013David Gleich · Purdue 32A1A4Q1R1Mapper 1A2 Q2R2A3 Q3R3A4 Q4Q1Q2Q3Q4R1R2R3R4R4QoutputRoutputQ11Q21Q31Q41RTask 2Q11Q21Q31Q41Q1Q2Q3Q4Mapper 31. Output local Q andR in separate files2. Collect R on onenode, compute Qsfor each piece3. Distribute thepieces of Q*1 andform the true Q
  33. 33. Theoretical lower bound on runtimefor a few cases on our small clusterRows Cols Old R-only+ no IRR-only+ PIRR-only+ IRDirectTSQR4.0B 4 1803 1821 1821 2343 25252.5B 10 1645 1655 1655 2062 24640.6B 25 804 812 812 1000 12370.5B 50 1240 1250 1250 1517 2103ICML2013David Gleich · Purdue 33All values insecondsOnly two paramsneeded – read andwrite bandwidth forthe cluster – inorder to derive aperformance modelof the algorithm.This simple modelis almost within afactor of two of thetrue runtime. "(10-node cluster,60 disks)Rows Cols Old R-only+ no IRR-only+ PIRR-only+ IRDirectTSQR4.0B 4 2931 3460 3620 4741 61282.5B 10 2508 2509 3354 4034 40350.6B 25 1098 1104 1476 2006 19100.5B 50 921 1618 1960 2655 3090ModelActual
  34. 34. Myapplication is "big simulationdata.ICML2013David Gleich · Purdue 34
  35. 35. ICML2013David Gleich · Purdue 35Nonlinear heat transfer model inrandom mediaEach run takes 5 hours on 8processors, we did 8192 runs.4.2M nodes, 9 time-steps, 128realizations, 64 size parametersSVD of 5B x 64 data matrix50x reduction in wall-clock timeincluding all pre and post-processing; “∞x” speedup for aparticular study without pre-processing (< 60 sec on laptop).Interpolating adaptive low-rankapproximations.Constantine, Gleich,"Hou & Templeton arXiv 2013.1error02(a) Error, s = 0.39 cm1std02(b) Std, s = 0.39 cm10error020(c) Error, s = 1.95 cm10std020(d) Std, s = 1.95 cmFig. 4.5: Error in the reduce order model compared to the prediction standard de-viation for one realization of the bubble locations at the final time for two values of10error020(c) Error, s = 1.95 cmFig. 4.5: Error in the reduce order model comviation for one realization of the bubble locatithe bubble radius, s = 0.39 and s = 1.95 cmversion.)the varying conductivity fields took approximaCubit after substantial optimizations.Working with the simulation data involvedinterpret 4TB of Exodus II files from Aria, gloTSSVD, and compute predictions and errors.imately 8-15 hours. We collected precise timiit as these times are from a multi-tenant, unojobs with sizes ranging between 100GB and 2T
  36. 36. Dynamic ModeDecompositionOne simulation, ~10TB of data, compute theSVD of a space-by-time matrix.ICML2013David Gleich · Purdue 36DMD video
  37. 37. Is double-precision sufficient?Double-precision floating point was designed for theera where “big” was 1000-10000s = 0; for i=1 to n: s = s + x[i]A simple summation formula has "error that is not always small if n is a billionWatch out for this issue in important computationsICML2013David Gleich · Purdue 37fl(x + y) = (x + y)(1 + ")fl(Xixi )Xixi  nµXi|xi | µ ⇡ 10 16
  38. 38. PapersConstantine & Gleich, MapReduce 2011Benson, Gleich & Demmel, arXiv 2013Constantine & Gleich, ICASSP 2012Constantine, Gleich, Hou & Templeton, "arXiv 2013ICML2013David Gleich · Purdue 38Codehttps://github.com/arbenson/mrtsqrhttps://github.com/dgleich/simform"Questions?

×