Simulation Informatics!Analyzing Large Datasetsfrom Scientific SimulationsDAVID F. GLEICH !     PAUL G. CONSTANTINE! PURDUE...
This talk is a story …                                                          2                  David Gleich · Purdue  ...
How I learned to stopworrying and love thesimulation!                                                        3            ...
I asked …!Can we do UQ onPageRank?                                                     4             David Gleich · Purdue...
PageRank by Google Google’s PageRank PageRank by Google                  3                       3                        ...
h sensitivity? alpha alpha PageRank PageRa         PageRank    RandomPageRank dom alpha  Random         alpha       RAPr  ...
Random alpha PageRank has   Convergence theorya rigorous convergence theory.   Method                        Conv. Work Re...
Working withPageRank showed ushow to treat UQ moregenerally …                                                   8         ...
Constantine, Gleich, and Iaccarino.We studied           Spectral Methods for Parameterized                     Matrix Equa...
Simulation!The Third Pillar of Science21st Century Science in a nutshell!    Experiments are not practical or feasible.   ...
The messageInsight and confidencerequires multiple runs.                                                          11       ...
The problemA simulation run ain’t cheap!                                                         12                 David ...
Another problemIt’s very hard to “modify”current codes.                                                          13       ...
Large scale nonlinear, timedependent heat transfer problem                    105 nodes                    103 time steps ...
It’s time to ask "What can sciencelearn from Google?""                      - Wired Magazine (2008)                       ...
We can throw the numbers                              21.1st Century Scienceinto the biggest computing    in a nutshell?cl...
Our approach!Construct an interpolatingreduced order model from abudget-constrained ensemble ofruns for uncertainty andopt...
That is, we store the runs Supercomputer            Data computing cluster         EngineerEach multi-day HPC     A data c...
The Database       Input "                                                     Time history"          s1 -> f1    Paramete...
The interpolantMotivation!                                               This idea was inspired byLet the data give you th...
Why the SVD?! Let’s study a simple case.    2                                                                3        g(x1...
Method summaryCompute SVD of X!Compute interpolant of right singular vectorsApproximate a new value of f(s)!              ...
A quiz!Which section would you rathertry and interpolate, A or B?          A          B                                   ...
How predictable is a !singular vector?Folk Theorem (O’Leary 2011)The singular vectors of a matrix of “smooth” databecome m...
A refined method with !an error model                                Don’t even try to                                     ...
Our current approach tochoosing the predictability  t(s) is the largest  such that        ⌧        X      1              @...
An experimental test case                                A heat equation                                problem           ...
Experiments 20 point, Latin hypercube sample                                                                             2...
Our Reduced Order ModelWhere the error is the worst                                The Truth                              ...
A Large Scale ExampleNonlinear heat transfer model80k nodes, 300 time-steps104 basis runsSVD of 24m x 104 data matrix 500x...
PART 2!Tall-and-skinnyQR (and SVD)!on MapReduce                                                  31          David Gleich ...
Quick review of QR QR FactorizationLet                              , real                         Using QR for regression...
Intro to MapReduceOriginated at Google for indexing web   Data scalablepages and computing PageRank.                Maps  ...
Mesh point variance in MapReduce          Run 1                Run 2                         Run 3T=1   T=2    T=3   T=1  ...
Mesh point variance in MapReduce             Run 1                 Run 2                        Run 3 T=1     T=2     T=3 ...
Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR                                    ...
Serial QR factorizations!Fully serialet al. 2008)  (Demmel TSQR                   Compute QR of    ,                   rea...
Tall-and-skinnymatrix storageMapReduce matrixstorage in MapReduce                                                         ...
Algorithm                                             Data Rows of a matrix              A1   A1                        Ma...
Key LimitationsComputes only R and not QCan get Q via Q = AR+ with another MR iteration.   (we currently use this for comp...
In hadoopy  Full code in hadoopyimport random, numpy, hadoopy                            def close(self):class SerialTSQR:...
Lots many maps? an iteration.Too of data? Add Add an iteration!                   map           emit                      ...
mrtsqr – of parametersparametersSummary summary ofBlocksize How many rows to                                              ...
Varying splitsize and the treeData Varying splitsize Synthetic Cols.   Iters.   Split   Maps   Secs.   Increasing split si...
MapReduceTSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rowsMap QR factorization of local ...
Our vision!To enable analystsand engineers tohypothesize from                Paul G. Constantine                          ...
Upcoming SlideShare
Loading in …5
×

Simulation Informatics; Analyzing Large Scientific Datasets

1,274 views
1,198 views

Published on

A talk I gave at the Purdue CS&E Seminar Series.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,274
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Simulation Informatics; Analyzing Large Scientific Datasets

  1. 1. Simulation Informatics!Analyzing Large Datasetsfrom Scientific SimulationsDAVID F. GLEICH ! PAUL G. CONSTANTINE! PURDUE UNIVERSITY STANFORD UNIVERSITYCOMPUTER SCIENCE ! JOE RUTHRUFF! DEPARTMENT & JEREMY TEMPLETON ! SANDIA NATIONAL LABS 1 David Gleich · Purdue CS&E Seminar
  2. 2. This talk is a story … 2 David Gleich · Purdue CS&E Seminar
  3. 3. How I learned to stopworrying and love thesimulation! 3 David Gleich · Purdue CS&E Seminar
  4. 4. I asked …!Can we do UQ onPageRank? 4 David Gleich · Purdue CS&E Seminar
  5. 5. PageRank by Google Google’s PageRank PageRank by Google 3 3 The Model 2 5 1.The Model uniformly with follow edges 2 4 5 1. follow edges uniformly with probability , and 4 2. randomly jump, with probability probability and 1 6 2. randomlyassume everywhere is 1 , we’ll jump with probability 1 6 equally, likely assume everywhere is 1 we’ll equally likely The places we find the The places we find the surfer most often are im- portant pages. often are im- surfer most portant pages. 5 David F. Gleich (Sandia) PageRank intro David Gleich · Purdue CS&E Seminar/ 36 Purdue 5
  6. 6. h sensitivity? alpha alpha PageRank PageRa PageRank RandomPageRank dom alpha Random alpha RAPr or PageRank meets UQ ( P)x = (1 )v s the random variables as the random variables Model PageRank ageRank as the random variables y to the links : examined and understoo x(A) x(A) x(A) and look atk E [x(A)] and Std [x(A)] . at E [x(A)] and Std [x(A)] .y to the E [x(A)]: and Std [x(A)] .understood, jump examined, Explored in Constantine and Gleich, WAW2007; and " Constantine and Gleich, J. Internet Mathematics 2011. 6 David Gleich · Purdue CS&E Seminar
  7. 7. Random alpha PageRank has Convergence theorya rigorous convergence theory. Method Conv. Work Required What is N? 1 number of Monte Carlo p N PageRank systems N samples from A Path Damping r N+2 N + 1 matrix vector terms of (without N1+ products Neumann series Std [x(A)]) number of Gaussian r 2N N PageRank systems quadrature Quadrature points and r are parameters from Bet ( , b, , r) 7 David F. Gleich (Sandia) David Random sensitivity Gleich · Purdue CS&E Seminar / 36 Purdue 27
  8. 8. Working withPageRank showed ushow to treat UQ moregenerally … 8 David Gleich · Purdue CS&E Seminar
  9. 9. Constantine, Gleich, and Iaccarino.We studied Spectral Methods for Parameterized Matrix Equations, SIMAX, 2010.parameterized A(s)x(s) = b(s) matrices. , A(J 1 )x(J 1 ) = b(J 1 ) ) A(J N )x(J N ) = b(J N ) or Parameterized Solution ) AN (J 1 )xN (J 1 ) = bN (J 1 ) Constantine, Gleich, and Iaccarino. AA(s)x(s) = b(s) factorization of the spectral Galerkin system for parameterized matrix equations: derivation and applications, SISC 2011. How to compute the Galerkin solution Discretized PDE in a weakly intrusive manner.! with explicit parameters 9 David Gleich · Purdue CS&E Seminar
  10. 10. Simulation!The Third Pillar of Science21st Century Science in a nutshell! Experiments are not practical or feasible. Simulate things instead.But do we trust the simulations?!We’re trying! Model Fidelity Verification & Validation (V&V) Uncertainty Quantification (UQ) 10 David Gleich · Purdue CS&E Seminar
  11. 11. The messageInsight and confidencerequires multiple runs. 11 David Gleich · Purdue CS&E Seminar
  12. 12. The problemA simulation run ain’t cheap! 12 David Gleich · Purdue CS&E Seminar
  13. 13. Another problemIt’s very hard to “modify”current codes. 13 David Gleich · Purdue CS&E Seminar
  14. 14. Large scale nonlinear, timedependent heat transfer problem 105 nodes 103 time steps 30 minutes on 16 cores Questions What is the probability of failure? Which input values cause failure? 14 David Gleich · Purdue CS&E Seminar
  15. 15. It’s time to ask "What can sciencelearn from Google?"" - Wired Magazine (2008) 15 David Gleich · Purdue CS&E Seminar
  16. 16. We can throw the numbers 21.1st Century Scienceinto the biggest computing in a nutshell?clusters the world has ever Simulations are "seen and let statistical too expensive.algorithms find patterns Let data provide awhere science cannot. surrogate.- Wired (again) 16/18 David Gleich · Purdue CS&E Seminar
  17. 17. Our approach!Construct an interpolatingreduced order model from abudget-constrained ensemble ofruns for uncertainty andoptimization studies. 17 David Gleich · Purdue CS&E Seminar
  18. 18. That is, we store the runs Supercomputer Data computing cluster EngineerEach multi-day HPC A data cluster can … enabling engineers to querysimulation generates hold hundreds or thousands and analyze months of simulationgigabytes of data. of old simulations … data for statistical studies and uncertainty quantification. and build the interpolant from the pre-computed data. 18 David Gleich · Purdue CS&E Seminar
  19. 19. The Database Input " Time history" s1 -> f1 Parameters of simulation s2 -> f2 s f sk -> fk 2 3 A single simulationThe simulation as a vector q(x1 , t1 , s) 6 . . 7 at one time step 6 . 7 6 7 6q(xn , t1 , s)7 6 7 6q(x1 , t2 , s)7 6 7 ⇥ ⇤ f(s) = 6 . 7 6 6 . . 7 7 X = f(s1 ) f(s2 ) ... f(sp ) 6q(xn , t2 , s)7 6 7 6 . 7 The database as a matrix 4 . . 5 q(xn , tk , s) 19 David Gleich · Purdue CS&E Seminar
  20. 20. The interpolantMotivation! This idea was inspired byLet the data give you the basis. the success of other ⇥ ⇤ reduced order models X = f(s1 ) f(s2 ) ... f(sp ) like POD; and Paul’s residual minimizing idea.Then find the right combination Xr f(s) ⇡ uj ↵j (s) j=1 These are the left singular vectors from X! 20 David Gleich · Purdue CS&E Seminar
  21. 21. Why the SVD?! Let’s study a simple case. 2 3 g(x1 , s1 ) g(x1 , s2 ) ··· g(x1 , sp ) 6 .. .. . . 7 6 g(x2 , s1 ) . . . 7X=6 6 . 7 7 4 . .. .. . . . g(xm 1 , sp )5 treat each right g(xm , s1 ) g(xm , sp singular vector ··· 1) g(xm , sp ). as samples of = U⌃VT , the unknown r X r X basis functionsg(xi , sj ) = Ui,` ` Vj,` = u` (xi ) ` v` (sj ) `=1 `=1 split x and s a general parameter r p X X (`)g(xi , s) = u` (xi ) ` v` (s) v` (s) ⇡ v` (sj ) j (s) `=1 j=1 Interpolate v any way you wish 21 David Gleich · Purdue CS&E Seminar
  22. 22. Method summaryCompute SVD of X!Compute interpolant of right singular vectorsApproximate a new value of f(s)! 22 David Gleich · Purdue CS&E Seminar
  23. 23. A quiz!Which section would you rathertry and interpolate, A or B? A B 23 David Gleich · Purdue CS&E Seminar
  24. 24. How predictable is a !singular vector?Folk Theorem (O’Leary 2011)The singular vectors of a matrix of “smooth” databecome more oscillatory as the index increases.Implication!The gradient of the singular vectors increases asthe index increases. v1 (s), v2 (s), ... , vt (s) vt+1 (s), ... , vr (s) Predictable Unpredictable 24 David Gleich · Purdue CS&E Seminar
  25. 25. A refined method with !an error model Don’t even try to interpolate the predictable modes. t(s) r X Xf(s) ⇡ uj ↵j (s) + uj j ⌘j j=1 Predictable j=t(s)+1 Unpredictable ⌘j ⇠ N(0, 1) 0 1 r X TA Variance[f] = diag @ j uj uj j=t(s)+1 But now, how to choose t(s)? 25 David Gleich · Purdue CS&E Seminar
  26. 26. Our current approach tochoosing the predictability t(s) is the largest such that ⌧ X 1 @vi i threshold 1 @s i=1 26 David Gleich · Purdue CSE Seminar
  27. 27. An experimental test case A heat equation problem Two parameters that control the material properties 27 David Gleich · Purdue CSE Seminar
  28. 28. Experiments 20 point, Latin hypercube sample 28 David Gleich · Purdue CSE Seminar
  29. 29. Our Reduced Order ModelWhere the error is the worst The Truth 29 David Gleich · Purdue CSE Seminar
  30. 30. A Large Scale ExampleNonlinear heat transfer model80k nodes, 300 time-steps104 basis runsSVD of 24m x 104 data matrix 500x reduction in wall clock time(100x including the SVD) 30 David Gleich · Purdue CSE Seminar
  31. 31. PART 2!Tall-and-skinnyQR (and SVD)!on MapReduce 31 David Gleich · Purdue CSE Seminar
  32. 32. Quick review of QR QR FactorizationLet    , real Using QR for regression    is given by    the solution of    QR is block normalization   is    orthogonal (   ) “normalize” a vector usually generalizes to computing    in the QR   is    upper triangular. 0 A = Q R 32David Gleich (Sandia) David MapReduce 2011 Gleich · Purdue CSE Seminar 4/22
  33. 33. Intro to MapReduceOriginated at Google for indexing web Data scalablepages and computing PageRank. Maps M M 1 2 1 MThe idea Bring the Reduce 2 M M Mcomputations to the data. R 3 4 3 M R M MExpress algorithms in 4 5 5 M Shuffledata-local operations. Fault-tolerance by designImplement one type of Input stored in triplicatecommunication: shuffle. M Reduce input/ output on disk MShuffle moves all data with M Rthe same key to the same M Rreducer. Map output persisted to disk 33 before shuffle David Gleich · Purdue CSE Seminar
  34. 34. Mesh point variance in MapReduce Run 1 Run 2 Run 3T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 34 David Gleich · Purdue CSE Seminar
  35. 35. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shuffle moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 35 David Gleich · Purdue CSE Seminar
  36. 36. Communication avoiding QRCommunication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new “R” matrix    36 Demmel et al.David Communicating avoiding CSE and sequential QR. 2008. Gleich · Purdue parallel Seminar
  37. 37. Serial QR factorizations!Fully serialet al. 2008) (Demmel TSQR Compute QR of    , read    , update QR, … 37 Demmel et al. 2008. Communicating avoidingparallel and sequential QR. David Gleich · Purdue CSE Seminar
  38. 38. Tall-and-skinnymatrix storageMapReduce matrixstorage in MapReduce   A1Key is an arbitrary row-idValue is the    array for A2 a row. A3Each submatrix    is an input split. A4 38David Gleich (Sandia) MapReduce 2011 10/2 David Gleich · Purdue CSE Seminar
  39. 39. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 39 David Gleich · Purdue CSE Seminar
  40. 40. Key LimitationsComputes only R and not QCan get Q via Q = AR+ with another MR iteration. (we currently use this for computing the SVD) Dubious numerical stability; iterative refinement helps.Working on better ways to compute Q (with Austin Benson, Jim Demmel) 40 David Gleich · Purdue CSE Seminar
  41. 41. In hadoopy Full code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)self.bsize*len(self.data[0]): self.compress() 41 David Gleich (Sandia) MapReduce 2011 13/22 David Gleich · Purdue CSE Seminar
  42. 42. Lots many maps? an iteration.Too of data? Add Add an iteration! map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) A2 S Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 A2 S3 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 42David Gleich (Sandia) MapReduce 2011 14/22 David Gleich · Purdue CSE Seminar
  43. 43. mrtsqr – of parametersparametersSummary summary ofBlocksize How many rows to A1 A1 read before computing a QR qr factorization, expressed as a A2 A2 Q2 multiple of the number of columns (See paper) map emit R1Splitsize The size of each local A1 Mapper 1-1 matrix Serial TSQRReduction tree (Red) S(2) The number of (Red) (Red) S(2) shuffle reducers and S(1) A iterations to use Iteration 1 Iter 2 Iter 3 43David Gleich (Sandia) MapReduce 2011 David 15/22 Gleich · Purdue CSE Seminar
  44. 44. Varying splitsize and the treeData Varying splitsize Synthetic Cols. Iters. Split Maps Secs. Increasing split size (MB) improves performance 50 1 64 8000 388 (accounts for Hadoop – – 256 2000 184 data movement) – – 512 1000 149 – 2 64 8000 425 Increasing iterations helps – – 256 2000 220 for problems with many columns. – – 512 1000 191 1000 1 512 1000 666 (1000 columns with 64-MB split size overloaded the – 2 64 6000 590 single reducer.) – – 256 2000 432 – – 512 1000 337 44 David Gleich · Purdue CSE Seminar
  45. 45. MapReduceTSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rowsMap QR factorization of local rows Demmel et al. showed that this construction works toReduce QR factorization of local rows compute a QR factorization with minimal communicationInput 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute    (the norm of each column) 161 sec.Time to compute    in qr(   ) 387 sec. 45 On a 64-node Hadoop cluster with · Purdue CSE Seminar David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
  46. 46. Our vision!To enable analystsand engineers tohypothesize from Paul G. Constantine data computations Sandia! Jeremy Templeton Joe Ruthruffinstead of expensive … and you ? …HPC computations. 46 David Gleich · Purdue CSE Seminar

×