Upcoming SlideShare
×

# Tall-and-skinny QR factorizations in MapReduce architectures

3,300 views

Published on

An update on how to compute tall-and-skinny QR factorizations in MapReduce

Published in: Education, Technology
1 Comment
4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No
Views
Total views
3,300
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
57
1
Likes
4
Embeds 0
No embeds

No notes for slide

### Tall-and-skinny QR factorizations in MapReduce architectures

1. 1. Tall-and-Skinny QRFactorizations inMapReduceDAVID F. GLEICH ! PAUL G. CONSTANTINE! PURDUE UNIVERSITY STANFORD UNIVERSITYCOMPUTER SCIENCE ! AUSTIN BENSON AND ! DEPARTMENT JAMES DEMMEL ! UC BERKELEY 1 David Gleich Purdue ICME Seminar
2. 2. Tall-and-Skinny matrices (m ≫ n) arise in regression with many samples block iterative methods panel factorizations model reduction problemsA general linear models " with many samples tall-and-skinny SVD/PCA From tinyimages" All of these applications ! collection need a QR factorization of ! a tall-and-skinny matrix.! some only need  R ! 2 David Gleich Purdue ICME Seminar
3. 3. MapReduce is great for TSQR!!You don’t need  ATAData A tall and skinny (TS) matrix by rowsMap QR factorization of local rowsReduce QR factorization of local rowsInput 500,000,000-by-100 matrix"Each record 1-by-100 row"HDFS Size 423.3 GBTime to compute Ae (column sums) 161 sec.!Time to compute  R in qr(A) 387 sec. !Time to compute Q in qr(A) stably … stay tuned … 3/22 David Gleich Purdue ICME Seminar
4. 4. Quick review of QR QR FactorizationLet    , real Using QR for regression    is given by    the solution of    QR is block normalization   is    orthogonal (   ) “normalize” a vector usually generalizes to computing    in the QR   is    upper triangular. 0 A = Q R 4David Gleich (Sandia) David MapReduce 2011 Gleich Purdue ICME Seminar 4/22
5. 5. Quiz for ICME qual-takers!Is R unique? 5 David Gleich Purdue ICME Seminar
6. 6. PCA of 80,000,000! images First 16 columns of V as images1000 pixels R    V SVD (principal TSQR components) 80,000,000 images Top 100 A X singular values Zero" mean" rows 6/22 MapReduce Post Processing David Gleich Purdue ICME Seminar
7. 7. The principal components looklike a 2d cosine transform Thanks Wikipedia! 7 David Gleich Purdue ICME Seminar
8. 8. The Database Input " Time history" s1 -> f1 Parameters of simulation s2 -> f2 s f sk -> fk 2 3 A single simulationThe simulation as a vector q(x1 , t1 , s) 6 . . 7 at one time step 6 . 7 6 7 6q(xn , t1 , s)7 6 7 6q(x1 , t2 , s)7 6 7 ⇥ ⇤ f(s) = 6 . 7 6 6 . . 7 7 X = f(s1 ) f(s2 ) ... f(sp ) 6q(xn , t2 , s)7 6 7 6 . 7 The database as a very" 4 . . 5 tall-and-skinny matrix q(xn , tk , s) 8 David Gleich · Purdue CS&E Seminar
9. 9. MapReduce 9 David Gleich Purdue ICME Seminar
10. 10. Intro to MapReduceOriginated at Google for indexing web Data scalablepages and computing PageRank. Maps M M 1 2 1 MThe idea Bring the Reduce 2 M M Mcomputations to the data. R 3 4 3 M R M MExpress algorithms in " 4 5 5 M Shufﬂedata-local operations. Fault-tolerance by designImplement one type of Input stored in triplicatecommunication: shufﬂe. M Reduce input/" output on disk MShufﬂe moves all data with M Rthe same key to the same M Rreducer. Map output" persisted to disk" 10 before shufﬂe David Gleich Purdue ICME Seminar
11. 11. Mesh point variance in MapReduce Run 1 Run 2 Run 3T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 11 David Gleich Purdue ICME Seminar
12. 12. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shufﬂe moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 12 David Gleich Purdue ICME Seminar
13. 13. MapReduce vs. Hadoop. MapReduce! Hadoop! A computation An implementation model with:" of MapReduce Map a local data using the HDFS transform" parallel ﬁle-system. Shufﬂe a grouping Others ! function " Pheonix++, Twisted, Reduce an Google MapReduce, aggregation spark … 13 David Gleich Purdue ICME Seminar
14. 14. That is, we store the runs Supercomputer Data computing cluster EngineerEach multi-day HPC A data cluster can … enabling engineers to querysimulation generates hold hundreds or thousands and analyze months of simulationgigabytes of data. of old simulations … data for statistical studies and uncertainty quantification. and build the interpolant from the pre-computed data. 14 David Gleich · Purdue CS&E Seminar
15. 15. Tall-and-Skinny QR 15 David Gleich Purdue ICME Seminar
16. 16. Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new “R” matrix    16 Demmel et al.David Communicating avoiding parallel and sequential QR. 2008. Gleich Purdue ICME Seminar
17. 17. Serial QR factorizations!Fully serialet al. 2008) (Demmel TSQR Compute QR of    , read    , update QR, … 17 David Gleich Purdue ICME Seminar Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
18. 18. Tall-and-skinny matrixMapReduce matrix storagestorage in MapReduce   A1Key is an arbitrary row-idValue is the    array for A2 a row. A3Each submatrix    is an input split. A4 Ask Austin about block row storage! 18 David Gleich Purdue ICME SeminarDavid Gleich (Sandia) MapReduce 2011 10/2
19. 19. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 19 David Gleich Purdue ICME Seminar
20. 20. Key Limitations Computes only R and not Q Can get Q via Q = AR+ with another MR iteration. (We currently use this for computing the SVD.) Dubious numerical stability; iterative reﬁnement helps. 20 David Gleich Purdue ICME Seminar
21. 21. Why MapReduce? 21 David Gleich Purdue ICME Seminar
22. 22. In hadoopy Full code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 22 David Gleich (Sandia) MapReduceDavid 2011 Gleich Purdue ICME Seminar 13/22
23. 23. Fault injection 200 Faults (200M by 200) Time to completion (sec) With 1/5 tasks failing, No faults (200M by 200) the job only takes twice 100 Faults (800M by 10) as long. No faults " (800M by 10) 10 100 1000 1/Prob(failure) – mean number of success per failure 23 David Gleich Purdue ICME Seminar
24. 24. How to get Q? 24 David Gleich Purdue ICME Seminar
25. 25. Idea 1 (unstable) Mapper 1 R+ A1 Q1 R+ A2 Q2 R TSQR R+ A3 Q3 Dist ribu R+ te A4 Q4 R 25 David Gleich Purdue ICME Seminar
26. 26. There’s a famous quote that “two iterationsIdea 2 (better) of iterative reﬁnement are enough” attributed to Parlett Mapper 1 Mapper 2 R+ S+ A1 Q1 A1 Q1 R+ S+ A2 Q2 A2 Q2 R T TSQR TSQR R+ S+ Dist A3 Q3 A3 Q3 Dist RT S= ribu ribu R+ S+ te te " A4 Q4 A4 Q4 R 26 David Gleich Purdue ICME Seminar
27. 27. Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new “R” matrix    27 Demmel et al.David Communicating avoiding parallel and sequential QR. 2008. Gleich Purdue ICME Seminar
28. 28. Idea 3 (best!) 3. Distribute the pieces of Q*1 and form the true Q Mapper 1 Mapper 3 Task 2 R1 Q11 A1 Q1 R1 Q11 R Q1 Q1 R2 Q21 Q output R output R2 R3 Q31 Q21 A2 Q2 Q2 Q2 R4 Q41 R3 Q31 2. Collect R on one A3 Q3 Q3 Q3 node, compute Qs for each piece R4 Q41 A4 Q4 Q4 Q4 1. Output local Q and R in separate ﬁles 28 David Gleich Purdue ICME Seminar
29. 29. Stability Achieving numerical stability AR+ norm ( QTQ – I ) AR+ + iterative reﬁnement Full TSQR 102 108 Condition number 29 David Gleich Purdue ICME Seminar
30. 30. The price is right! 2500 Full TSQR is faster than reﬁnement for … and not any seconds few columns slower for many columns. 500 30 David Gleich Purdue ICME Seminar
31. 31. Improving performance 31 David Gleich Purdue ICME Seminar
32. 32. Lots many maps? an iteration.Too of data? Add Add an iteration! map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) A2 S Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 A2 S3 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 32David Gleich (Sandia) MapReduce 2011 14/22 David Gleich Purdue ICME Seminar
33. 33. mrtsqr – of parametersparametersSummary summary ofBlocksize How many rows to A1 A1 read before computing a QR qr factorization, expressed as a A2 A2 Q2 multiple of the number of columns (See paper) map emit R1Splitsize The size of each local A1 Mapper 1-1 matrix Serial TSQRReduction tree (Red) S(2) The number of (Red) (Red) S(2) shuffle reducers and S(1) A iterations to use Iteration 1 Iter 2 Iter 3 33David Gleich (Sandia) MapReduce 2011 David Gleich Purdue 15/22 ICME Seminar
34. 34. Varying splitsize and the treeData Varying splitsize Synthetic Cols. Iters. Split Maps Secs. Increasing split size (MB) improves performance 50 1 64 8000 388 (accounts for Hadoop – – 256 2000 184 data movement) – – 512 1000 149 – 2 64 8000 425 Increasing iterations helps – – 256 2000 220 for problems with many columns. – – 512 1000 191 1000 1 512 1000 666 (1000 columns with 64-MB split size overloaded the – 2 64 6000 590 single reducer.) – – 256 2000 432 – – 512 1000 337 34 David Gleich Purdue ICME Seminar
35. 35. Currently investigatinghow to do multipleiterations with FullTSQR. 35 David Gleich Purdue ICME Seminar
36. 36. MapReduceTSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rowsMap QR factorization of local rows Demmel et al. showed that this construction works toReduce QR factorization of local rows compute a QR factorization with minimal communicationInput 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute    (the norm of each column) 161 sec.Time to compute    in qr(   ) 387 sec. 36 On a 64-node Hadoop cluster with Purdue Core i7-920, 12GB RAM/node David Gleich 4x2TB, one ICME Seminar
37. 37. Do I have to !write in Java? 37 David Gleich Purdue ICME Seminar
38. 38. Hadoop streamingperformanceHadoop streaming frameworks Synthetic data test 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streaming Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+Atlas Custom C++ TypedBytes reader/writer with Atlas New non-streaming Java implementation too Iter 1 Iter 1 Iter 2 Overall QR (secs.) Total (secs.) Total (secs.) Total (secs.)Dumbo 67725 960 217 1177Hadoopy 70909 612 118 730C++ 15809 350 37 387Java 436 66 502 C++ in streaming beats a native Java implementation. 38 All timing results from the Hadoop job trackerDavid Gleich (Sandia) David Gleich Purdue MapReduce 2011 ICME Seminar 16/22
39. 39. What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices. Halko, 9 RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION Algorithm: Randomized PCA Martinsson, Tropp. an q, this procedure computes an kapproximate rank-2k factorization Given exponent m × n matrix A, the number of principal components, and an SIREV 2011 U ΣV ∗ . The columns of U estimate the ﬁrst 2k principal components of A. A: Stage Generate an n × 2k Gaussian test matrix Ω. 1 2 Form Y = (AA∗ )q AΩ by multiplying alternately with A and A∗ 3 Construct a matrix Q whose columns form an orthonormal basis for the B: range of Y . Stage Form B = Q∗ A. 1 2 Compute an SVD of the small matrix: B = U ΣV ∗ . 3 Set U = QU . 39 singular spectrum of the data matrix often decays quite slowly. To address ICME Seminar David Gleich Purdue this diﬃ-
40. 40. Questions? Most recent code at http://github.com/arbenson/mrtsqr 40David Gleich Purdue ICME Seminar
41. 41. Varying BlocksizeVarying blocksize 41 David Gleich Purdue ICME Seminar