# Simulation Informatics; Analyzing Large Scientific Datasets

A talk I gave at the Purdue CS&E Seminar Series.

A talk I gave at the Purdue CS&E Seminar Series.

Published in: Technology, Education
### Transcript

• 1. Simulation Informatics!Analyzing Large Datasetsfrom Scienti&#xFB01;c SimulationsDAVID F. GLEICH ! PAUL G. CONSTANTINE! PURDUE UNIVERSITY STANFORD UNIVERSITYCOMPUTER SCIENCE ! JOE RUTHRUFF! DEPARTMENT &amp; JEREMY TEMPLETON ! SANDIA NATIONAL LABS 1 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 2. This talk is a story &#x2026; 2 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 3. How I learned to stopworrying and love thesimulation! 3 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 4. I asked &#x2026;!Can we do UQ onPageRank? 4 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 5. PageRank by Google Google&#x2019;s PageRank PageRank by Google 3 3 The Model 2 5 1.The Model uniformly with follow edges 2 4 5 1. follow edges uniformly with probability , and 4 2. randomly jump, with probability probability and 1 6 2. randomlyassume everywhere is 1 , we&#x2019;ll jump with probability 1 6 equally, likely assume everywhere is 1 we&#x2019;ll equally likely The places we &#xFB01;nd the The places we &#xFB01;nd the surfer most often are im- portant pages. often are im- surfer most portant pages. 5 David F. Gleich (Sandia) PageRank intro David Gleich &#xB7; Purdue CS&amp;E Seminar/ 36 Purdue 5
• 6. h sensitivity? alpha alpha PageRank PageRa PageRank RandomPageRank dom alpha Random alpha RAPr or PageRank meets UQ ( P)x = (1 )v s the random variables as the random variables Model PageRank ageRank as the random variables y to the links : examined and understoo x(A) x(A) x(A) and look atk E [x(A)] and Std [x(A)] . at E [x(A)] and Std [x(A)] .y to the E [x(A)]: and Std [x(A)] .understood, jump examined, Explored in Constantine and Gleich, WAW2007; and " Constantine and Gleich, J. Internet Mathematics 2011. 6 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 7. Random alpha PageRank has Convergence theorya rigorous convergence theory. Method Conv. Work Required What is N? 1 number of Monte Carlo p N PageRank systems N samples from A Path Damping r N+2 N + 1 matrix vector terms of (without N1+ products Neumann series Std [x(A)]) number of Gaussian r 2N N PageRank systems quadrature Quadrature points and r are parameters from Bet ( , b, , r) 7 David F. Gleich (Sandia) David Random sensitivity Gleich &#xB7; Purdue CS&amp;E Seminar / 36 Purdue 27
• 8. Working withPageRank showed ushow to treat UQ moregenerally &#x2026; 8 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 9. Constantine, Gleich, and Iaccarino.We studied Spectral Methods for Parameterized Matrix Equations, SIMAX, 2010.parameterized A(s)x(s) = b(s) matrices. , A(J 1 )x(J 1 ) = b(J 1 ) ) A(J N )x(J N ) = b(J N ) or Parameterized Solution ) AN (J 1 )xN (J 1 ) = bN (J 1 ) Constantine, Gleich, and Iaccarino. AA(s)x(s) = b(s) factorization of the spectral Galerkin system for parameterized matrix equations: derivation and applications, SISC 2011. How to compute the Galerkin solution Discretized PDE in a weakly intrusive manner.! with explicit parameters 9 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 10. Simulation!The Third Pillar of Science21st Century Science in a nutshell! Experiments are not practical or feasible. Simulate things instead.But do we trust the simulations?!We&#x2019;re trying! Model Fidelity Veri&#xFB01;cation &amp; Validation (V&amp;V) Uncertainty Quanti&#xFB01;cation (UQ) 10 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 11. The messageInsight and con&#xFB01;dencerequires multiple runs. 11 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 12. The problemA simulation run ain&#x2019;t cheap! 12 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 13. Another problemIt&#x2019;s very hard to &#x201C;modify&#x201D;current codes. 13 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 14. Large scale nonlinear, timedependent heat transfer problem 105 nodes 103 time steps 30 minutes on 16 cores Questions What is the probability of failure? Which input values cause failure? 14 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 15. It&#x2019;s time to ask "What can sciencelearn from Google?"" - Wired Magazine (2008) 15 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 16. We can throw the numbers 21.1st Century Scienceinto the biggest computing in a nutshell?clusters the world has ever Simulations are "seen and let statistical too expensive.algorithms &#xFB01;nd patterns Let data provide awhere science cannot. surrogate.- Wired (again) 16/18 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 17. Our approach!Construct an interpolatingreduced order model from abudget-constrained ensemble ofruns for uncertainty andoptimization studies. 17 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 18. That is, we store the runs Supercomputer Data computing cluster EngineerEach multi-day HPC A data cluster can &#x2026; enabling engineers to querysimulation generates hold hundreds or thousands and analyze months of simulationgigabytes of data. of old simulations &#x2026; data for statistical studies and uncertainty quantification. and build the interpolant from the pre-computed data. 18 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 19. The Database Input " Time history" s1 -&gt; f1 Parameters of simulation s2 -&gt; f2 s f sk -&gt; fk 2 3 A single simulationThe simulation as a vector q(x1 , t1 , s) 6 . . 7 at one time step 6 . 7 6 7 6q(xn , t1 , s)7 6 7 6q(x1 , t2 , s)7 6 7 &#x21E5; &#x21E4; f(s) = 6 . 7 6 6 . . 7 7 X = f(s1 ) f(s2 ) ... f(sp ) 6q(xn , t2 , s)7 6 7 6 . 7 The database as a matrix 4 . . 5 q(xn , tk , s) 19 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 20. The interpolantMotivation! This idea was inspired byLet the data give you the basis. the success of other &#x21E5; &#x21E4; reduced order models X = f(s1 ) f(s2 ) ... f(sp ) like POD; and Paul&#x2019;s residual minimizing idea.Then &#xFB01;nd the right combination Xr f(s) &#x21E1; uj &#x21B5;j (s) j=1 These are the left singular vectors from X! 20 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 21. Why the SVD?! Let&#x2019;s study a simple case. 2 3 g(x1 , s1 ) g(x1 , s2 ) &#xB7;&#xB7;&#xB7; g(x1 , sp ) 6 .. .. . . 7 6 g(x2 , s1 ) . . . 7X=6 6 . 7 7 4 . .. .. . . . g(xm 1 , sp )5 treat each right g(xm , s1 ) g(xm , sp singular vector &#xB7;&#xB7;&#xB7; 1) g(xm , sp ). as samples of = U&#x2303;VT , the unknown r X r X basis functionsg(xi , sj ) = Ui,` ` Vj,` = u` (xi ) ` v` (sj ) `=1 `=1 split x and s a general parameter r p X X (`)g(xi , s) = u` (xi ) ` v` (s) v` (s) &#x21E1; v` (sj ) j (s) `=1 j=1 Interpolate v any way you wish 21 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 22. Method summaryCompute SVD of X!Compute interpolant of right singular vectorsApproximate a new value of f(s)! 22 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 23. A quiz!Which section would you rathertry and interpolate, A or B? A B 23 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 24. How predictable is a !singular vector?Folk Theorem (O&#x2019;Leary 2011)The singular vectors of a matrix of &#x201C;smooth&#x201D; databecome more oscillatory as the index increases.Implication!The gradient of the singular vectors increases asthe index increases. v1 (s), v2 (s), ... , vt (s) vt+1 (s), ... , vr (s) Predictable Unpredictable 24 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 25. A re&#xFB01;ned method with !an error model Don&#x2019;t even try to interpolate the predictable modes. t(s) r X Xf(s) &#x21E1; uj &#x21B5;j (s) + uj j &#x2318;j j=1 Predictable j=t(s)+1 Unpredictable &#x2318;j &#x21E0; N(0, 1) 0 1 r X TA Variance[f] = diag @ j uj uj j=t(s)+1 But now, how to choose t(s)? 25 David Gleich &#xB7; Purdue CS&amp;E Seminar
• 26. Our current approach tochoosing the predictability t(s) is the largest such that &#x2327; X 1 @vi i threshold 1 @s i=1 26 David Gleich &#xB7; Purdue CSE Seminar
• 27. An experimental test case A heat equation problem Two parameters that control the material properties 27 David Gleich &#xB7; Purdue CSE Seminar
• 28. Experiments 20 point, Latin hypercube sample 28 David Gleich &#xB7; Purdue CSE Seminar
• 29. Our Reduced Order ModelWhere the error is the worst The Truth 29 David Gleich &#xB7; Purdue CSE Seminar
• 30. A Large Scale ExampleNonlinear heat transfer model80k nodes, 300 time-steps104 basis runsSVD of 24m x 104 data matrix 500x reduction in wall clock time(100x including the SVD) 30 David Gleich &#xB7; Purdue CSE Seminar
• 31. PART 2!Tall-and-skinnyQR (and SVD)!on MapReduce 31 David Gleich &#xB7; Purdue CSE Seminar
• 32. Quick review of QR QR FactorizationLet &#x2009;&#x202F; , real Using QR for regression &#x2009;&#x202F; is given by &#x2009;&#x202F; the solution of &#x2009;&#x202F; QR is block normalization&#x2009;&#x202F; is &#x2009;&#x202F; orthogonal (&#x2009;&#x202F; ) &#x201C;normalize&#x201D; a vector usually generalizes to computing &#x2009;&#x202F; in the QR&#x2009;&#x202F; is &#x2009;&#x202F; upper triangular. 0 A = Q R 32David Gleich (Sandia) David MapReduce 2011 Gleich &#xB7; Purdue CSE Seminar 4/22
• 33. Intro to MapReduceOriginated at Google for indexing web Data scalablepages and computing PageRank. Maps M M 1 2 1 MThe idea Bring the Reduce 2 M M Mcomputations to the data. R 3 4 3 M R M MExpress algorithms in 4 5 5 M Shuf&#xFB02;edata-local operations. Fault-tolerance by designImplement one type of Input stored in triplicatecommunication: shuf&#xFB02;e. M Reduce input/ output on disk MShuf&#xFB02;e moves all data with M Rthe same key to the same M Rreducer. Map output persisted to disk 33 before shuf&#xFB02;e David Gleich &#xB7; Purdue CSE Seminar
• 34. Mesh point variance in MapReduce Run 1 Run 2 Run 3T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 34 David Gleich &#xB7; Purdue CSE Seminar
• 35. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shuf&#xFB02;e moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 35 David Gleich &#xB7; Purdue CSE Seminar
• 36. Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new &#x201C;R&#x201D; matrix &#x2009;&#x202F; 36 Demmel et al.David Communicating avoiding CSE and sequential QR. 2008. Gleich &#xB7; Purdue parallel Seminar
• 37. Serial QR factorizations!Fully serialet al. 2008) (Demmel TSQR Compute QR of &#x2009;&#x202F; , read &#x2009;&#x202F; , update QR, &#x2026; 37 Demmel et al. 2008. Communicating avoidingparallel and sequential QR. David Gleich &#xB7; Purdue CSE Seminar
• 38. Tall-and-skinnymatrix storageMapReduce matrixstorage in MapReduce&#x2009;&#x202F; A1Key is an arbitrary row-idValue is the &#x2009;&#x202F; array for A2 a row. A3Each submatrix &#x2009;&#x202F; is an input split. A4 38David Gleich (Sandia) MapReduce 2011 10/2 David Gleich &#xB7; Purdue CSE Seminar
• 39. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 39 David Gleich &#xB7; Purdue CSE Seminar
• 40. Key LimitationsComputes only R and not QCan get Q via Q = AR+ with another MR iteration. (we currently use this for computing the SVD) Dubious numerical stability; iterative re&#xFB01;nement helps.Working on better ways to compute Q (with Austin Benson, Jim Demmel) 40 David Gleich &#xB7; Purdue CSE Seminar
• 41. In hadoopy Full code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)self.bsize*len(self.data[0]): self.compress() 41 David Gleich (Sandia) MapReduce 2011 13/22 David Gleich &#xB7; Purdue CSE Seminar
• 42. Lots many maps? an iteration.Too of data? Add Add an iteration! map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) A2 S Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 A2 S3 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 42David Gleich (Sandia) MapReduce 2011 14/22 David Gleich &#xB7; Purdue CSE Seminar
• 43. mrtsqr &#x2013; of parametersparametersSummary summary ofBlocksize How many rows to A1 A1 read before computing a QR qr factorization, expressed as a A2 A2 Q2 multiple of the number of columns (See paper) map emit R1Splitsize The size of each local A1 Mapper 1-1 matrix Serial TSQRReduction tree (Red) S(2) The number of (Red) (Red) S(2) shuffle reducers and S(1) A iterations to use Iteration 1 Iter 2 Iter 3 43David Gleich (Sandia) MapReduce 2011 David 15/22 Gleich &#xB7; Purdue CSE Seminar
• 44. Varying splitsize and the treeData Varying splitsize Synthetic Cols. Iters. Split Maps Secs. Increasing split size (MB) improves performance 50 1 64 8000 388 (accounts for Hadoop &#x2013; &#x2013; 256 2000 184 data movement) &#x2013; &#x2013; 512 1000 149 &#x2013; 2 64 8000 425 Increasing iterations helps &#x2013; &#x2013; 256 2000 220 for problems with many columns. &#x2013; &#x2013; 512 1000 191 1000 1 512 1000 666 (1000 columns with 64-MB split size overloaded the &#x2013; 2 64 6000 590 single reducer.) &#x2013; &#x2013; 256 2000 432 &#x2013; &#x2013; 512 1000 337 44 David Gleich &#xB7; Purdue CSE Seminar
• 45. MapReduceTSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rowsMap QR factorization of local rows Demmel et al. showed that this construction works toReduce QR factorization of local rows compute a QR factorization with minimal communicationInput 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute &#x2009;&#x202F; (the norm of each column) 161 sec.Time to compute &#x2009;&#x202F; in qr(&#x2009;&#x202F; ) 387 sec. 45 On a 64-node Hadoop cluster with &#xB7; Purdue CSE Seminar David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
• 46. Our vision!To enable analystsand engineers tohypothesize from Paul G. Constantine data computations Sandia! Jeremy Templeton Joe Ruthruffinstead of expensive &#x2026; and you ? &#x2026;HPC computations. 46 David Gleich &#xB7; Purdue CSE Seminar