Upcoming SlideShare
×

# Simulation Informatics; Analyzing Large Scientific Datasets

1,274 views
1,198 views

Published on

A talk I gave at the Purdue CS&E Seminar Series.

Published in: Technology, Education
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,274
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
26
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Simulation Informatics; Analyzing Large Scientific Datasets

1. 1. Simulation Informatics!Analyzing Large Datasetsfrom Scientiﬁc SimulationsDAVID F. GLEICH ! PAUL G. CONSTANTINE! PURDUE UNIVERSITY STANFORD UNIVERSITYCOMPUTER SCIENCE ! JOE RUTHRUFF! DEPARTMENT & JEREMY TEMPLETON ! SANDIA NATIONAL LABS 1 David Gleich · Purdue CS&E Seminar
2. 2. This talk is a story … 2 David Gleich · Purdue CS&E Seminar
3. 3. How I learned to stopworrying and love thesimulation! 3 David Gleich · Purdue CS&E Seminar
4. 4. I asked …!Can we do UQ onPageRank? 4 David Gleich · Purdue CS&E Seminar
5. 5. PageRank by Google Google’s PageRank PageRank by Google 3 3 The Model 2 5 1.The Model uniformly with follow edges 2 4 5 1. follow edges uniformly with probability , and 4 2. randomly jump, with probability probability and 1 6 2. randomlyassume everywhere is 1 , we’ll jump with probability 1 6 equally, likely assume everywhere is 1 we’ll equally likely The places we ﬁnd the The places we ﬁnd the surfer most often are im- portant pages. often are im- surfer most portant pages. 5 David F. Gleich (Sandia) PageRank intro David Gleich · Purdue CS&E Seminar/ 36 Purdue 5
6. 6. h sensitivity? alpha alpha PageRank PageRa PageRank RandomPageRank dom alpha Random alpha RAPr or PageRank meets UQ ( P)x = (1 )v s the random variables as the random variables Model PageRank ageRank as the random variables y to the links : examined and understoo x(A) x(A) x(A) and look atk E [x(A)] and Std [x(A)] . at E [x(A)] and Std [x(A)] .y to the E [x(A)]: and Std [x(A)] .understood, jump examined, Explored in Constantine and Gleich, WAW2007; and " Constantine and Gleich, J. Internet Mathematics 2011. 6 David Gleich · Purdue CS&E Seminar
7. 7. Random alpha PageRank has Convergence theorya rigorous convergence theory. Method Conv. Work Required What is N? 1 number of Monte Carlo p N PageRank systems N samples from A Path Damping r N+2 N + 1 matrix vector terms of (without N1+ products Neumann series Std [x(A)]) number of Gaussian r 2N N PageRank systems quadrature Quadrature points and r are parameters from Bet ( , b, , r) 7 David F. Gleich (Sandia) David Random sensitivity Gleich · Purdue CS&E Seminar / 36 Purdue 27
8. 8. Working withPageRank showed ushow to treat UQ moregenerally … 8 David Gleich · Purdue CS&E Seminar
9. 9. Constantine, Gleich, and Iaccarino.We studied Spectral Methods for Parameterized Matrix Equations, SIMAX, 2010.parameterized A(s)x(s) = b(s) matrices. , A(J 1 )x(J 1 ) = b(J 1 ) ) A(J N )x(J N ) = b(J N ) or Parameterized Solution ) AN (J 1 )xN (J 1 ) = bN (J 1 ) Constantine, Gleich, and Iaccarino. AA(s)x(s) = b(s) factorization of the spectral Galerkin system for parameterized matrix equations: derivation and applications, SISC 2011. How to compute the Galerkin solution Discretized PDE in a weakly intrusive manner.! with explicit parameters 9 David Gleich · Purdue CS&E Seminar
10. 10. Simulation!The Third Pillar of Science21st Century Science in a nutshell! Experiments are not practical or feasible. Simulate things instead.But do we trust the simulations?!We’re trying! Model Fidelity Veriﬁcation & Validation (V&V) Uncertainty Quantiﬁcation (UQ) 10 David Gleich · Purdue CS&E Seminar
11. 11. The messageInsight and conﬁdencerequires multiple runs. 11 David Gleich · Purdue CS&E Seminar
12. 12. The problemA simulation run ain’t cheap! 12 David Gleich · Purdue CS&E Seminar
13. 13. Another problemIt’s very hard to “modify”current codes. 13 David Gleich · Purdue CS&E Seminar
14. 14. Large scale nonlinear, timedependent heat transfer problem 105 nodes 103 time steps 30 minutes on 16 cores Questions What is the probability of failure? Which input values cause failure? 14 David Gleich · Purdue CS&E Seminar
15. 15. It’s time to ask "What can sciencelearn from Google?"" - Wired Magazine (2008) 15 David Gleich · Purdue CS&E Seminar
16. 16. We can throw the numbers 21.1st Century Scienceinto the biggest computing in a nutshell?clusters the world has ever Simulations are "seen and let statistical too expensive.algorithms ﬁnd patterns Let data provide awhere science cannot. surrogate.- Wired (again) 16/18 David Gleich · Purdue CS&E Seminar
17. 17. Our approach!Construct an interpolatingreduced order model from abudget-constrained ensemble ofruns for uncertainty andoptimization studies. 17 David Gleich · Purdue CS&E Seminar
18. 18. That is, we store the runs Supercomputer Data computing cluster EngineerEach multi-day HPC A data cluster can … enabling engineers to querysimulation generates hold hundreds or thousands and analyze months of simulationgigabytes of data. of old simulations … data for statistical studies and uncertainty quantification. and build the interpolant from the pre-computed data. 18 David Gleich · Purdue CS&E Seminar
19. 19. The Database Input " Time history" s1 -> f1 Parameters of simulation s2 -> f2 s f sk -> fk 2 3 A single simulationThe simulation as a vector q(x1 , t1 , s) 6 . . 7 at one time step 6 . 7 6 7 6q(xn , t1 , s)7 6 7 6q(x1 , t2 , s)7 6 7 ⇥ ⇤ f(s) = 6 . 7 6 6 . . 7 7 X = f(s1 ) f(s2 ) ... f(sp ) 6q(xn , t2 , s)7 6 7 6 . 7 The database as a matrix 4 . . 5 q(xn , tk , s) 19 David Gleich · Purdue CS&E Seminar
20. 20. The interpolantMotivation! This idea was inspired byLet the data give you the basis. the success of other ⇥ ⇤ reduced order models X = f(s1 ) f(s2 ) ... f(sp ) like POD; and Paul’s residual minimizing idea.Then ﬁnd the right combination Xr f(s) ⇡ uj ↵j (s) j=1 These are the left singular vectors from X! 20 David Gleich · Purdue CS&E Seminar
21. 21. Why the SVD?! Let’s study a simple case. 2 3 g(x1 , s1 ) g(x1 , s2 ) ··· g(x1 , sp ) 6 .. .. . . 7 6 g(x2 , s1 ) . . . 7X=6 6 . 7 7 4 . .. .. . . . g(xm 1 , sp )5 treat each right g(xm , s1 ) g(xm , sp singular vector ··· 1) g(xm , sp ). as samples of = U⌃VT , the unknown r X r X basis functionsg(xi , sj ) = Ui,` ` Vj,` = u` (xi ) ` v` (sj ) `=1 `=1 split x and s a general parameter r p X X (`)g(xi , s) = u` (xi ) ` v` (s) v` (s) ⇡ v` (sj ) j (s) `=1 j=1 Interpolate v any way you wish 21 David Gleich · Purdue CS&E Seminar
22. 22. Method summaryCompute SVD of X!Compute interpolant of right singular vectorsApproximate a new value of f(s)! 22 David Gleich · Purdue CS&E Seminar
23. 23. A quiz!Which section would you rathertry and interpolate, A or B? A B 23 David Gleich · Purdue CS&E Seminar
24. 24. How predictable is a !singular vector?Folk Theorem (O’Leary 2011)The singular vectors of a matrix of “smooth” databecome more oscillatory as the index increases.Implication!The gradient of the singular vectors increases asthe index increases. v1 (s), v2 (s), ... , vt (s) vt+1 (s), ... , vr (s) Predictable Unpredictable 24 David Gleich · Purdue CS&E Seminar
25. 25. A reﬁned method with !an error model Don’t even try to interpolate the predictable modes. t(s) r X Xf(s) ⇡ uj ↵j (s) + uj j ⌘j j=1 Predictable j=t(s)+1 Unpredictable ⌘j ⇠ N(0, 1) 0 1 r X TA Variance[f] = diag @ j uj uj j=t(s)+1 But now, how to choose t(s)? 25 David Gleich · Purdue CS&E Seminar
26. 26. Our current approach tochoosing the predictability t(s) is the largest such that ⌧ X 1 @vi i threshold 1 @s i=1 26 David Gleich · Purdue CSE Seminar
27. 27. An experimental test case A heat equation problem Two parameters that control the material properties 27 David Gleich · Purdue CSE Seminar
28. 28. Experiments 20 point, Latin hypercube sample 28 David Gleich · Purdue CSE Seminar
29. 29. Our Reduced Order ModelWhere the error is the worst The Truth 29 David Gleich · Purdue CSE Seminar
30. 30. A Large Scale ExampleNonlinear heat transfer model80k nodes, 300 time-steps104 basis runsSVD of 24m x 104 data matrix 500x reduction in wall clock time(100x including the SVD) 30 David Gleich · Purdue CSE Seminar
31. 31. PART 2!Tall-and-skinnyQR (and SVD)!on MapReduce 31 David Gleich · Purdue CSE Seminar
32. 32. Quick review of QR QR FactorizationLet    , real Using QR for regression    is given by    the solution of    QR is block normalization   is    orthogonal (   ) “normalize” a vector usually generalizes to computing    in the QR   is    upper triangular. 0 A = Q R 32David Gleich (Sandia) David MapReduce 2011 Gleich · Purdue CSE Seminar 4/22
33. 33. Intro to MapReduceOriginated at Google for indexing web Data scalablepages and computing PageRank. Maps M M 1 2 1 MThe idea Bring the Reduce 2 M M Mcomputations to the data. R 3 4 3 M R M MExpress algorithms in 4 5 5 M Shufﬂedata-local operations. Fault-tolerance by designImplement one type of Input stored in triplicatecommunication: shufﬂe. M Reduce input/ output on disk MShufﬂe moves all data with M Rthe same key to the same M Rreducer. Map output persisted to disk 33 before shufﬂe David Gleich · Purdue CSE Seminar
34. 34. Mesh point variance in MapReduce Run 1 Run 2 Run 3T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 34 David Gleich · Purdue CSE Seminar
35. 35. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shufﬂe moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 35 David Gleich · Purdue CSE Seminar
36. 36. Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new “R” matrix    36 Demmel et al.David Communicating avoiding CSE and sequential QR. 2008. Gleich · Purdue parallel Seminar
37. 37. Serial QR factorizations!Fully serialet al. 2008) (Demmel TSQR Compute QR of    , read    , update QR, … 37 Demmel et al. 2008. Communicating avoidingparallel and sequential QR. David Gleich · Purdue CSE Seminar
38. 38. Tall-and-skinnymatrix storageMapReduce matrixstorage in MapReduce   A1Key is an arbitrary row-idValue is the    array for A2 a row. A3Each submatrix    is an input split. A4 38David Gleich (Sandia) MapReduce 2011 10/2 David Gleich · Purdue CSE Seminar
39. 39. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 39 David Gleich · Purdue CSE Seminar
40. 40. Key LimitationsComputes only R and not QCan get Q via Q = AR+ with another MR iteration. (we currently use this for computing the SVD) Dubious numerical stability; iterative reﬁnement helps.Working on better ways to compute Q (with Austin Benson, Jim Demmel) 40 David Gleich · Purdue CSE Seminar
41. 41. In hadoopy Full code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)self.bsize*len(self.data[0]): self.compress() 41 David Gleich (Sandia) MapReduce 2011 13/22 David Gleich · Purdue CSE Seminar
42. 42. Lots many maps? an iteration.Too of data? Add Add an iteration! map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) A2 S Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 A2 S3 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 42David Gleich (Sandia) MapReduce 2011 14/22 David Gleich · Purdue CSE Seminar
43. 43. mrtsqr – of parametersparametersSummary summary ofBlocksize How many rows to A1 A1 read before computing a QR qr factorization, expressed as a A2 A2 Q2 multiple of the number of columns (See paper) map emit R1Splitsize The size of each local A1 Mapper 1-1 matrix Serial TSQRReduction tree (Red) S(2) The number of (Red) (Red) S(2) shuffle reducers and S(1) A iterations to use Iteration 1 Iter 2 Iter 3 43David Gleich (Sandia) MapReduce 2011 David 15/22 Gleich · Purdue CSE Seminar
44. 44. Varying splitsize and the treeData Varying splitsize Synthetic Cols. Iters. Split Maps Secs. Increasing split size (MB) improves performance 50 1 64 8000 388 (accounts for Hadoop – – 256 2000 184 data movement) – – 512 1000 149 – 2 64 8000 425 Increasing iterations helps – – 256 2000 220 for problems with many columns. – – 512 1000 191 1000 1 512 1000 666 (1000 columns with 64-MB split size overloaded the – 2 64 6000 590 single reducer.) – – 256 2000 432 – – 512 1000 337 44 David Gleich · Purdue CSE Seminar
45. 45. MapReduceTSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rowsMap QR factorization of local rows Demmel et al. showed that this construction works toReduce QR factorization of local rows compute a QR factorization with minimal communicationInput 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute    (the norm of each column) 161 sec.Time to compute    in qr(   ) 387 sec. 45 On a 64-node Hadoop cluster with · Purdue CSE Seminar David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
46. 46. Our vision!To enable analystsand engineers tohypothesize from Paul G. Constantine data computations Sandia! Jeremy Templeton Joe Ruthruffinstead of expensive … and you ? …HPC computations. 46 David Gleich · Purdue CSE Seminar