SIMULATION INFORMATICS
                                     A data-driven methodology for
                                   synthesis from simulation results

                                                David F. Gleich
                          John von Neumann Postdoctoral Fellow
                                   Sandia National Laboratories
                                                 Livermore, CA




David Gleich (Sandia)   5/5/2011                                 1/18
A bit more background on me
Before
Undergrad Harvey Mudd College (Joint CS/Math)
Internships Yahoo! and Microsoft Internships
Graduate Stanford (Computation and Mathematical Engineering)
    Thesis: Models and Algorithm’s for PageRank Sensitivity
Internships Intel, Joint Library of Congress/Stanford, Microsoft
Post-doc 1 Univ. of British Columbia

Now
Post-doc 2 2010 John von Neumann fellow @ Sandia

In August
Tenure-Track position at Purdue’s CS department

David Gleich (Sandia)        5/5/2011                        2/18
It’s time to ask:
                        What can science
                        learn from Google?
                        -- Wired Magazine (2008)



                                     http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
David Gleich (Sandia)                5/5/2011                                                     3/18
A tale of two computers …




    Cielo (#10 in top 500)                       SNL/CA Nebula Cluster
    Separate storage system                      2TB/core storage

    This platform is awesome                     This platform is awesome
    for simulating physical systems              for working with data

    Cost ~$50 million                            Cost $150k
David Gleich (Sandia)                 5/5/2011                              4/18
Simulation:The Third Pillar of Science
21st Century Science in a nutshell
     Experiments are not practical / feasible.
     Simulate things instead.


But do we trust the simulations?
We’re trying!

Model Fidelity
Verification & Validation (V&V)
Uncertainty Quantification (UQ)


Insight and confidence requires multiple runs.
David Gleich (Sandia)                5/5/2011    5/18
The problem Simulation ain’t cheap!
We can throw the numbers into
                                              21.1st Century Science
the biggest computing clusters the            in a nutshell?
world has ever seen and let                     Simulations are
statistical algorithms find patterns            too expensive.
where science cannot.                           Let data provide a
--Wired (again)                                 surrogate.
                                                           The Database

     Input                                 Time history      s1 -> f1
  Parameters                               of simulation
                                                             s2 -> f2
         s                                      f
                                                             sk -> fk


David Gleich (Sandia)           5/5/2011                                  6/18
Our CSRF project Simulation Informatics
                                      The People
                                      Myself
                                      Paul G. Constantine
                                        (2009 von Neumann)
 To enable analysts and
 engineers to hypothesize from
 inexpensive data computations
 instead of expensive HPC
 computations.

                                      Stefan Domino
                                      Jeremy Templeton
                                      Joe Ruthruff


David Gleich (Sandia)      5/5/2011                          7/18
The idea and vision: Store the runs!
      Supercomputer           MapReduce cluster              Engineer




    Each multi-day HPC     A MapReduce cluster can    … enabling engineers to query
    simulation generates   hold hundreds or thousands and analyze months of simulation
    gigabytes of data.     of old simulations …       data for statistical studies and
                                                      uncertainty quantification.




David Gleich (Sandia)                  5/5/2011                                   8/18
MapReduce
Originated at Google for                  Data scalable
indexing web pages and                            Maps                            M        M
                                                                                   1       2
                                             1     M
computing PageRank                           2     M
                                                       Reduce
                                                                                  M        M
                                                          R                        3       4
                                             3     M
                                                   M      R
                                             4                                         M
The idea                                                                               5
                                             5     M Shuffle
Bring the computations to the data.
Express algorithms in                     Fault tolerance by design
  data-local operations.                         Input stored in triplicate
                                                                          Reduce input/
Implement one type of                                        M
                                                                          output on disk
  communication: shuffle.                                    M          R
                                                             M
Shuffle moves all data with                                             R
                                                             M
  the same key to the same                                   M Map output
  reducer                                                          persisted to disk
                                                                   before shuffle
David Gleich (Sandia)                 5/5/2011                                                 9/18
Mesh point variance in MapReduce
                                  Three simulation runs for three time steps each.

                    Run 1                              Run 2                               Run 3


T=1           T=2           T=3         T=1       T=2            T=3     T=1         T=2           T=3




David Gleich (Sandia)                                 5/5/2011                                           10/18
Mesh point variance in MapReduce
                                   Three simulation runs for three time steps each.

                     Run 1                              Run 2                            Run 3


 T=1           T=2           T=3           T=1     T=2            T=3     T=1         T=2        T=3
                 M                                    M                                  M


1. Each mapper out-                                                                   2. Shuffle moves all
puts the mesh points                                                                  values from the same
with the same key.                                                                    mesh point to the same
                                       R                                  R           reducer.


   3. Reducers just compute
   a numerical variance.


                                                                        Bring the computations
                                                                        to the data.!
 David Gleich (Sandia)                                 5/5/2011                                         11/18
Hadoop???
                        Hadoop: an open-source
                        MapReduce system
                        supported by Yahoo!




                            Hadoop was the name
                            of Doug Cutting’s
                            son’s stuffed elephant.
                         Photo credit, New York Times (2009)
                         http://www.nytimes.com/2009/03/17/tech
                          nology/business-computing/17cloud.html
David Gleich (Sandia)                 5/5/2011                     12/18
MapReduce and Surrogate Models
A surrogate model
is a function that
reproduces the
                                f1        Surrogate
output of a simul-
                                           Sample
ation and predicts
its output at new                    f2
parameter values.


                        f5

    The Database                                                                New Samples
                                            The Surrogate
         s1 -> f1            Extraction                         Interpolation    sa -> fa
         s2 -> f2                                                                sb -> fb

         sk -> fk                            Just one machine
                                                                                 sc -> fc
  On the MapReduce cluster                                                On the MapReduce cluster
David Gleich (Sandia)                           5/5/2011                                       13/18
MapReduce + Simulations
Thermal Race problem
     Joe Ruthruff and Jeremy Templeton
Heating an unclassified NW model to simulate a fire

Goal make sure the weak link fails
  before the strong link
Objective determine where
  parameters are interesting                 30 min per run
Parameters material properties               700GB of Raw Data
                                             ~64GB of data for
1000 sample                                    MapReduce
parameter
study using                                  Our results
                                             Extraction 30 min
                                             Interpolation 5 min

David Gleich (Sandia)             5/5/2011                         14/18
Data driven Green’s functions




 The simulation varies most in the corners of the parameter space.

   These analyses will help understand which parameters matter,
          and where to continue running the simulations.

David Gleich (Sandia)          5/5/2011                         15/18
What can science
                        learn from Google?
                        The importance of data.

                        For more about this, see a talk
                        and paper by Peter Norvig

                        “Internet-Scale Data Analysis”
                        http://www.stanford.edu/group/
                        mmds/slides2010/Norvig.pdf

                        Halevy et al. “The unreasonable
                        effectiveness of data.” IEEE
                        Intelligent systems, 2009.

David Gleich (Sandia)                5/5/2011             16/18
The future?
                                                  MapReduce computers
                                                  embedded within super
                                                  computers as the parallel file
                                                  system and data analysis
                                                  platform.

                                                  The best of both worlds.


  Our work
  Laying the foundation for useful
  surrogate models that will
  make data based simulation a
  compelling addition to the
  simulation revolution.
                                 Any questions?
David Gleich (Sandia)                 5/5/2011                                 17/18

Simulation Informatics

  • 1.
    SIMULATION INFORMATICS A data-driven methodology for synthesis from simulation results David F. Gleich John von Neumann Postdoctoral Fellow Sandia National Laboratories Livermore, CA David Gleich (Sandia) 5/5/2011 1/18
  • 2.
    A bit morebackground on me Before Undergrad Harvey Mudd College (Joint CS/Math) Internships Yahoo! and Microsoft Internships Graduate Stanford (Computation and Mathematical Engineering) Thesis: Models and Algorithm’s for PageRank Sensitivity Internships Intel, Joint Library of Congress/Stanford, Microsoft Post-doc 1 Univ. of British Columbia Now Post-doc 2 2010 John von Neumann fellow @ Sandia In August Tenure-Track position at Purdue’s CS department David Gleich (Sandia) 5/5/2011 2/18
  • 3.
    It’s time toask: What can science learn from Google? -- Wired Magazine (2008) http://www.wired.com/science/discoveries/magazine/16-07/pb_theory David Gleich (Sandia) 5/5/2011 3/18
  • 4.
    A tale oftwo computers … Cielo (#10 in top 500) SNL/CA Nebula Cluster Separate storage system 2TB/core storage This platform is awesome This platform is awesome for simulating physical systems for working with data Cost ~$50 million Cost $150k David Gleich (Sandia) 5/5/2011 4/18
  • 5.
    Simulation:The Third Pillarof Science 21st Century Science in a nutshell Experiments are not practical / feasible. Simulate things instead. But do we trust the simulations? We’re trying! Model Fidelity Verification & Validation (V&V) Uncertainty Quantification (UQ) Insight and confidence requires multiple runs. David Gleich (Sandia) 5/5/2011 5/18
  • 6.
    The problem Simulationain’t cheap! We can throw the numbers into 21.1st Century Science the biggest computing clusters the in a nutshell? world has ever seen and let Simulations are statistical algorithms find patterns too expensive. where science cannot. Let data provide a --Wired (again) surrogate. The Database Input Time history s1 -> f1 Parameters of simulation s2 -> f2 s f sk -> fk David Gleich (Sandia) 5/5/2011 6/18
  • 7.
    Our CSRF projectSimulation Informatics The People Myself Paul G. Constantine (2009 von Neumann) To enable analysts and engineers to hypothesize from inexpensive data computations instead of expensive HPC computations. Stefan Domino Jeremy Templeton Joe Ruthruff David Gleich (Sandia) 5/5/2011 7/18
  • 8.
    The idea andvision: Store the runs! Supercomputer MapReduce cluster Engineer Each multi-day HPC A MapReduce cluster can … enabling engineers to query simulation generates hold hundreds or thousands and analyze months of simulation gigabytes of data. of old simulations … data for statistical studies and uncertainty quantification. David Gleich (Sandia) 5/5/2011 8/18
  • 9.
    MapReduce Originated at Googlefor Data scalable indexing web pages and Maps M M 1 2 1 M computing PageRank 2 M Reduce M M R 3 4 3 M M R 4 M The idea 5 5 M Shuffle Bring the computations to the data. Express algorithms in Fault tolerance by design data-local operations. Input stored in triplicate Reduce input/ Implement one type of M output on disk communication: shuffle. M R M Shuffle moves all data with R M the same key to the same M Map output reducer persisted to disk before shuffle David Gleich (Sandia) 5/5/2011 9/18
  • 10.
    Mesh point variancein MapReduce Three simulation runs for three time steps each. Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 David Gleich (Sandia) 5/5/2011 10/18
  • 11.
    Mesh point variancein MapReduce Three simulation runs for three time steps each. Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M 1. Each mapper out- 2. Shuffle moves all puts the mesh points values from the same with the same key. mesh point to the same R R reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data.! David Gleich (Sandia) 5/5/2011 11/18
  • 12.
    Hadoop??? Hadoop: an open-source MapReduce system supported by Yahoo! Hadoop was the name of Doug Cutting’s son’s stuffed elephant. Photo credit, New York Times (2009) http://www.nytimes.com/2009/03/17/tech nology/business-computing/17cloud.html David Gleich (Sandia) 5/5/2011 12/18
  • 13.
    MapReduce and SurrogateModels A surrogate model is a function that reproduces the f1 Surrogate output of a simul- Sample ation and predicts its output at new f2 parameter values. f5 The Database New Samples The Surrogate s1 -> f1 Extraction Interpolation sa -> fa s2 -> f2 sb -> fb sk -> fk Just one machine sc -> fc On the MapReduce cluster On the MapReduce cluster David Gleich (Sandia) 5/5/2011 13/18
  • 14.
    MapReduce + Simulations ThermalRace problem Joe Ruthruff and Jeremy Templeton Heating an unclassified NW model to simulate a fire Goal make sure the weak link fails before the strong link Objective determine where parameters are interesting 30 min per run Parameters material properties 700GB of Raw Data ~64GB of data for 1000 sample MapReduce parameter study using Our results Extraction 30 min Interpolation 5 min David Gleich (Sandia) 5/5/2011 14/18
  • 15.
    Data driven Green’sfunctions The simulation varies most in the corners of the parameter space. These analyses will help understand which parameters matter, and where to continue running the simulations. David Gleich (Sandia) 5/5/2011 15/18
  • 16.
    What can science learn from Google? The importance of data. For more about this, see a talk and paper by Peter Norvig “Internet-Scale Data Analysis” http://www.stanford.edu/group/ mmds/slides2010/Norvig.pdf Halevy et al. “The unreasonable effectiveness of data.” IEEE Intelligent systems, 2009. David Gleich (Sandia) 5/5/2011 16/18
  • 17.
    The future? MapReduce computers embedded within super computers as the parallel file system and data analysis platform. The best of both worlds. Our work Laying the foundation for useful surrogate models that will make data based simulation a compelling addition to the simulation revolution. Any questions? David Gleich (Sandia) 5/5/2011 17/18