MapReduce Tall-and-skinny QR and applications

653 views
535 views

Published on

A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
653
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MapReduce Tall-and-skinny QR and applications

  1. 1. A1 A2 A3 A4 Tall-and-skinny ! QRs and SVDs in MapReduce David F. Gleich! Computer Science! Purdue University! David Gleich · Purdue Simons PDAIO 1 Yangyang Hou " Joe Nichols – U. of Minn Purdue, CS James Demmel " Austin Benson " UC Berkeley Stanford University Joe Ruthruff " Paul G. Constantine Jeremy Templeton" Col. School. Mines " Sandia CA Mainly funded by Sandia National Labs CSAR project, recently by NSF CAREER," and Purdue Research Foundation.
  2. 2. David Gleich · Purdue Simons PDAIO 2 Big " simulation data
  3. 3. Nonlinear heat transfer model in random media Apply heat We did 8192 runs (128 samples of bubble locations, 64 bubble radii) 4.5 TB of data in Exodus II (NetCDF) https://www.opensciencedatacloud.org/ publicdata/heat-transfer/ David Gleich · Purdue Simons PDAIO 3 Look at temperature Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation
  4. 4. Non-insulator regime 1 True ROM RS 0.8 0.7 1 0.6 0.5 0.5 0.4 0 15 0.3 20 25 0.2 0.1 0 0 Insulator regime 10 20 30 40 Bubble radius 50 60 David Gleich · Purdue Simons PDAIO 4 Proportion of temp. > 475 K Proportion of temp. > 475 K 0.9
  5. 5. S VT Extract 128 x 128 face to laptop A U UF S VT 5B-by-64 matrix 2.2TB David Gleich · Purdue Simons PDAIO 5 Each simulation is a column SVD
  6. 6. Non-insulator regime 1 True ROM RS 0.8 0.7 1 0.6 0.5 0.5 0.4 0 15 0.3 20 25 0.2 0.1 0 0 Insulator regime 10 20 30 40 Bubble radius 50 60 David Gleich · Purdue Simons PDAIO 6 Proportion of temp. > 475 K 0.9
  7. 7. (a) Error, s = 0.39 cm . GLEICH, Y. HOU, AND J. TEMPLETON 20 (b) Std, s = 0.39 cm P. G. CONSTANTINE, D. F. GLEICH, Y. HOU, AND J. TEMPLETON imately 8-15 to the We collected precise timing information, but we do not report model compared hours. prediction standard deubbleit as these at the finalfrom afor two values of locations times are time multi-tenant, unoptimized Hadoop cluster Simons other David Gleich · Purdue where PDAIO 7 s s R(s, ⌧ ) ¯ E(s, ⌧ ) ¯ 0.08 16 1.00e-04 0.23 15 2.00e-04 0.39 14 4.00e-04 59 error std std 202 error 0.55 13 206.00e-04 55 2 −1 10 51 1 1 10 0.70 13 8.00e-04 47 00 0 0 −1.5 0.86 12 1.10e-03 43 39 1.01 11 1.50e-03 (c) Error, s = 1.95 cm (d) Std, s = 1.95 cm (b) Std, s = 0.39 cm −2 35 1.17 10 2.10e-03 31 Fig. 4.5: Error in the reduce order model compared to the prediction standard de27 1.33 9 3.10e-03 −2.5 viation for one realization of the bubble locations at the final time for two values of 23 1.48 8 4.50e-03 19 the bubble radius, s = 0.39 and s = 1.95−3 cm. (Colors are visible in the electronic 1.64 8 6.50e-03 15 version.) 11 1.79 7 8.20e-03 −3.5 7 1.95 7 1.07e-02 3 0 0.2 0.4 0.6 0.8 2.11 6 1.23e-02 τ ¯ error std 20 the varying conductivity fields took approximately twenty minutes to construct using 20 2.26 6 1.39e-02 10 Cubit after substantial optimizations. Fig. 4.4: The log of the relative10error in 0 Working with the simulation data involved Table 4.1: post-processing steps: the mean prediction of the ROM0 as a func-a few pre- and The split and the correspond interpret 4TB of Exodus II files from Aria, globally transpose the data, compute the Constantine, Gleich,! (d) the threshold ¯ tion of s and Std, s = 1.95 cm ⌧ . (Colors are ing ROM error for ⌧ = 0.55 and di↵eren ¯ TSSVD, and compute predictions and errors. The preprocessing steps took approxvisible in the electronic version.) values of Hou & Templeton arXiv 2013. s.
  8. 8. Dynamic Mode Decomposition One simulation, ~10TB of data, compute the SVD of a space-by-time matrix. David Gleich · Purdue Simons PDAIO 8 DMD video
  9. 9. Is this BIG Data? BIG Data has two properties - too big for one hard drive A - ‘skewed’ distribution BIG Data = “Big Internet Giant” Data BIG Data = “Big In’Gineering” Data David Gleich · Purdue Simons PDAIO 9 “Engineering”
  10. 10. A matrix A : m × n, m ≥ n! is tall and skinny when O(n2) ! work and storage is “cheap” compared to m. David Gleich · Purdue Simons PDAIO 10 -- Austin Benson
  11. 11.    is given by the solution of    Quick review of QR Let A : m × n, (   ≥ n, real orthogonal m ) A = QR Q is m × n orthogonal (QT Q = I ) upper n upper triangular R is n × triangular. A = Q QR is block normalization “normalize” a vector usually generalizes to computing    in the QR 0 R MapReduce 2011 David Gleich · Purdue Simons PDAIO 11 andia) , real
  12. 12. Tall-and-skinny SVD and RSVD R    SVD Let A : m × n, m ≥ n, real A = U𝞢VT TSQR U is m × n orthogonal Q 𝞢 is m × n nonneg, diag. V is n × n orthogonal David Gleich · Purdue Simons PDAIO 12 A V
  13. 13. There are good MPI implementations. David Gleich · Purdue Simons PDAIO 13 What’s left to do?
  14. 14. David Gleich · Purdue Simons PDAIO 14 Moving data to an MPI cluster may be infeasible or costly
  15. 15. How to store tall-and-skinny matrices in Hadoop A1 A2 A3 A4 David Gleich · Purdue Simons PDAIO 15 A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array " for a row (or b x n block) Each submatrix Ai is an " the input to a map task.
  16. 16. Still, isn’t this easy to do? Current MapReduce algs use the normal equations T A = QR A A A1 A2 A3 A4 Cholesky T !R R Map! Aii to AiTAi Reduce! RTR = Sum(AiTAi) Map 2! Aii to Ai R-1 Q = AR 1 Two problems! R inaccurate if A illconditioned Q not numerically orthogonal (House-" holder assures this) David Gleich · Purdue Simons PDAIO 16
  17. 17. Numerical stability was a problem for prior approaches 0 10 −5 AR-1 10 −10 10 −15 10 0 10 5 10 10 10 15 10 20 10 Condition number David Gleich · Purdue Simons PDAIO 17 Previous methods couldn’t ensure that the matrix Q was orthogonal norm ( QTQ – I ) Prior work
  18. 18. Four things that are better 1.  A simple algorithm to compute R accurately. (but doesn’t help get Q orthogonal). 2.  “Fast algorithm” to get Q numerically orthogonal in most cases. 3.  Multi-pass algorithm to get Q numerically orthogonal in virtually all cases. Constantine & Gleich MapReduce 2011 Benson, Gleich & Demmel IEEE BigData 2013 David Gleich · Purdue Simons PDAIO 18 4.  A direct algorithm for a numerically orthogonal Q in all cases.
  19. 19. Numerical stability was a problem for prior approaches 5 1. Constantine & Gleich, MapReduce 2011 0 Prior work 10 AR-1 −5 10 2. Benson, Gleich, Demmel, BigData’13 -1 AR + " nt refineme iterative −10 10 4. Direct TSQR Benson, Gleich, " Demmel, BigData’13 −15 10 0 10 5 10 10 10 15 10 Condition number 20 3. Benson, Gleich, 10 Demmel, BigData’13 David Gleich · Purdue Simons PDAIO 19 Previous methods couldn’t ensure that the matrix Q was orthogonal norm ( QTQ – I ) 10
  20. 20. MapReduce is great for TSQR!! You don’t need  ATA Data A tall and skinny (TS) matrix by rows Input 500,000,000-by-50 matrix" Each record 1-by-50 row" HDFS Size 183.6 GB Time to compute read A 253 sec. write A 848 sec.! Time to compute  R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. " Time to compute Q in qr(A) 3090 sec. (numerically stable)! David Gleich · Purdue Simons PDAIO 20
  21. 21. Communication avoiding QR Communication avoiding TSQR (Demmel et al. 2008) Second, compute a QR factorization of the new “R” 21 First, do QR factorizations of each local matrix    Demmel et al.David Communicating avoiding Simons PDAIO QR. 2008. Gleich · Purdue parallel and sequential
  22. 22. Serial QR factorizations! Fully serialet al. 2008) (Demmel TSQR David Gleich · Purdue Simons PDAIO Demmel et al. 2008. Communicating avoiding parallel and sequential QR. 22 Compute QR of    , read    , update QR, …
  23. 23. Communication avoiding QR (Demmel et al. 2008) ! on MapReduce (Constantine and Gleich, 2011) Algorithm Data Rows of a matrix A1 A2 A2 qr Q2 R2 A3 A3 Map QR factorization of rows Reduce QR factorization of rows qr Q3 A4 A4 A5 Serial TSQR A6 qr Q6 Serial TSQR Q4 R4 emit qr Q8 R8 emit R6 A7 A7 qr Q7 R7 A8 A8 Reducer 1 qr A5 A6 Mapper 2 R3 R4 R4 R8 R8 qr Q R emit David Gleich · Purdue Simons PDAIO 23 Mapper 1 Serial TSQR A1
  24. 24. Too many maps cause too much data to one reducer! map emit R4 R2,2 emit Reducer 1-2 Serial TSQR reduce S3 A2 S(2) A2 R2,3 shuffle R3 S A2 reduce emit identity map S(1) R2,1 Reducer 1-1 Serial TSQR reduce Mapper 1-3 Serial TSQR map A3 4 emit Mapper 1-2 Serial TSQR map A3 R2 shuffle A S1 Mapper 1-1 Serial TSQR map A2 reduce emit R emit Reducer 2-1 Serial TSQR emit Reducer 1-3 Serial TSQR emit Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 David Gleich · Purdue Simons PDAIO 24 A1 R1
  25. 25. David Gleich · Purdue Simons PDAIO 25 Getting Q
  26. 26. Numerical stability was a problem for prior approaches 5 1. Constantine & Gleich, MapReduce 2011! 0 Prior work 10 AR-1 AR-1 −5 10 2. Benson, Gleich, Demmel, BigData’13 -1 AR + " nt refineme iterative −10 10 4. Direct TSQR Benson, Gleich, " Demmel, BigData’13 −15 10 0 10 5 10 10 10 15 10 Condition number 20 3. Benson, Gleich, 10 Demmel, BigData’13 David Gleich · Purdue Simons PDAIO 26 Previous methods couldn’t ensure that the matrix Q was orthogonal norm ( QTQ – I ) 10
  27. 27. Iterative refinement helps Mapper 1 R-1 A1 A2 R TSQR R-1 Q2 Q3 Q4 T TSQR T R A4 R-1 Q1 te ribu Dist te ribu Dist A3 R-1 Mapper 2 T-1 Q1 Q2 Q3 Q4 A4 T-1 T-1 T-1 Q1 Q2 Q3 Q4 David Gleich · Purdue Simons PDAIO 27 Iterative refinement is like using Newton’s method to solve Ax = b. It’s folklore that “two iterations of iterative refinement are enough”
  28. 28. What if iterative refinement is too slow? Sample Mapper 1 R-1 A1 S A2 R-1 Q2 Q3 Q4 T TSQR TR A4 R-1 Q1 te ribu Dist " QR, pute Com ibute R distr A3 R-1 Mapper 2 R-1 T-1 A1 Q1 A2 A3 A4 R-1 T-1 R-1 T-1 R-1 T-1 Q2 Q3 Q4 Based on recent work by “random matrix” community on approximating QR with a random subset of rows. Also assumes that you can get a subset of rows “cheaply” – possible, but nontrivial in Hadoop. David Gleich · Purdue Simons PDAIO 28 Estimate the “norm” by S
  29. 29. Numerical stability was a problem for prior approaches 5 1. Constantine & Gleich, MapReduce 2011 0 Prior work 10 AR-1 AR-1 −5 10 2. Benson, Gleich, Demmel, BigData’13! -1 AR + " nt refineme iterative −10 10 4. Direct TSQR Benson, Gleich, " Demmel, BigData’13 −15 10 0 10 5 10 10 10 15 10 Condition number 20 3. Benson, Gleich, 10 Demmel, BigData’13! David Gleich · Purdue Simons PDAIO 29 Previous methods couldn’t ensure that the matrix Q was orthogonal norm ( QTQ – I ) 10
  30. 30. Mapper 1 Q2 A3 Q3 A4 Q4 R1 R2 R3 R4 R2 Q21 R3 Q31 R4 Q41 Mapper 3 Q11 Q1 Task 2 Q11 R 2. Collect R on one node, compute Qs for each piece Q output A2 Q1 R output A1 R1 3. Distribute the pieces of Q*1 and form the true Q Q2 Q3 Q4 Q21 Q31 Q41 Q1 Q2 Q3 Q4 1. Output local Q and R in separate files David Gleich · Purdue Simons PDAIO 30 Recreate Q by storing the history of the factorization
  31. 31. Theoretical lower bound on runtime for a few cases on our small cluster R-only R-only + no IR + PIR R-only + IR Direct TSQR 4.0B 4 1803 1821 1821 2343 2525 2.5B 10 1645 1655 1655 2062 2464 0.6B 25 804 812 812 1000 1237 50 1240 1250 1250 1517 2103 Cols Old R-only R-only + no IR + PIR R-only + IR Direct TSQR 4.0B 4 2931 3460 3620 4741 6128 2.5B 10 2508 2509 3354 4034 4035 0.6B 25 1098 1104 1476 2006 1910 0.5B 50 921 1618 1960 2655 3090 All values in seconds Only two params needed – read and write bandwidth for the cluster – in order to derive a performance model of the algorithm. This simple model is almost within a factor of two of the true runtime. " (10-node cluster, 60 disks) David Gleich · Purdue Simons PDAIO 31 Old Rows Actual Cols 0.5B Model Rows
  32. 32. Papers Constantine & Gleich, MapReduce 2011 Benson, Gleich & Demmel, BigData’13 Constantine & Gleich, ICASSP 2012 Constantine, Gleich, Hou & Templeton, " arXiv 2013 Questions? BIG Bloody Imposing Graphs Building Impressions of Groundtruth Blockwise Independent Guesses https://github.com/arbenson/mrtsqr Best Implemented at Google " https://github.com/dgleich/simform David Gleich · Purdue Simons PDAIO 32 Code

×