This document summarizes Rajesh Gandham's PhD thesis defense on high-order numerical methods for ocean modeling applications. The thesis goals are to develop accurate PDE models, leverage many-core hardware architectures, and use efficient algorithm techniques. The document outlines work on two-dimensional shallow water modeling using discontinuous Galerkin methods, a pasiDG simulator implementation, and preliminary work on three-dimensional oceanic modeling. Performance results are shown for the pasiDG simulator running on GPUs and CPUs for the 2004 Indian Ocean tsunami test case.
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layersyumakishi
Abstract
Obake-GAN (Perturbative GAN), which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP , BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolutional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the inception score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Obake-GAN is evaluated using conventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator .
修士論文「Obake-GAN: GAN with Perturbation Layers」の発表資料
GANの畳込層の代わりに摂動層を導入し、
・Generator 学習パラメータ52%削減
・Discriminator 学習パラメータ87%削減
・ImageNetでInception Score 45%改善
・学習の収束を高速化
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layersyumakishi
Abstract
Obake-GAN (Perturbative GAN), which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP , BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolutional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the inception score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Obake-GAN is evaluated using conventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator .
修士論文「Obake-GAN: GAN with Perturbation Layers」の発表資料
GANの畳込層の代わりに摂動層を導入し、
・Generator 学習パラメータ52%削減
・Discriminator 学習パラメータ87%削減
・ImageNetでInception Score 45%改善
・学習の収束を高速化
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
In this tutorial session we will discuss how dynamical modeling combined with time-series analysis and optimization can lead to an efficient management of complex water systems. We will introduce key performance indicators to evaluate the performance of the controlled system and formulate an economic model predictive control (EMPC) scheme to address the prescribed control objectives. We will also see how we can harness the computational power of graphics cards to accelerate complex computations involved in our control problems.
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism.
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
A 90-minute online presentation for zkStudyClub, delivered 2020-06-01. I present a new idea with a demonstrated 5% speed-up for multi-scalar multiplication. When combined with precomputation, this method could yield upwards of 20% speed-up.
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsMatt Moores
So-called “inverse” problems arise when the parameters of a physical system cannot be directly observed. The mapping between these latent parameters and the space of noisy observations is represented as a mathematical model, often involving a system of differential equations. We seek to infer the parameter values that best fit our observed data. However, it is also vital to obtain accurate quantification of the uncertainty involved with these parameters, particularly when the output of the model will be used for forecasting. Bayesian inference provides well-calibrated uncertainty estimates, represented by the posterior distribution over the parameters. In this talk, I will give a brief introduction to Markov chain Monte Carlo (MCMC) algorithms for sampling from the posterior distribution and describe how they can be combined with numerical solvers for the forward model. We apply these methods to two examples of ODE models: growth curves in ecology, and thermogravimetric analysis (TGA) in chemistry. This is joint work with Matthew Berry, Mark Nelson, Brian Monaghan and Raymond Longbottom.
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
* Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Numerical Approximation of Filtration Processes through Porous MediaRaheel Ahmed
In this MSc thesis, we studied numerical methods for the coupling of free fluid flow with porous medium flow. The free fluid flow is modelled by the Stokes equations while the flow in the porous medium is modelled by Darcy’s law. Appropriate conditions are imposed at the interface between the two regions. The weak formulation of the problem is based on mixed-formulation for Stokes and on a primal-mixed formulation for Darcy equation, incorporating in a natural way the interface conditions. The finite element discretization of the problem leads to large, sparse and ill-conditioned algebraic system to be solved for velocities in both domains, Stokes pressure and piezometric head in porous domain. The system is reduced to interface systems for the normal velocity and piezometric head by a Schur complement approach. We present numerical results for several solution methods based on different preconditioning techniques for the solution of the interface systems. We study the effectiveness of the preconditioners with respect to mesh refinement and physical parameters. An application to cross-flow membranes has been considered. Finally, we also assess the numerical accuracy of an uncoupled algorithm for transient problem, which uses different time steps in the Stokes and in the Darcy domains.
In this tutorial session we will discuss how dynamical modeling combined with time-series analysis and optimization can lead to an efficient management of complex water systems. We will introduce key performance indicators to evaluate the performance of the controlled system and formulate an economic model predictive control (EMPC) scheme to address the prescribed control objectives. We will also see how we can harness the computational power of graphics cards to accelerate complex computations involved in our control problems.
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism.
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
A 90-minute online presentation for zkStudyClub, delivered 2020-06-01. I present a new idea with a demonstrated 5% speed-up for multi-scalar multiplication. When combined with precomputation, this method could yield upwards of 20% speed-up.
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsMatt Moores
So-called “inverse” problems arise when the parameters of a physical system cannot be directly observed. The mapping between these latent parameters and the space of noisy observations is represented as a mathematical model, often involving a system of differential equations. We seek to infer the parameter values that best fit our observed data. However, it is also vital to obtain accurate quantification of the uncertainty involved with these parameters, particularly when the output of the model will be used for forecasting. Bayesian inference provides well-calibrated uncertainty estimates, represented by the posterior distribution over the parameters. In this talk, I will give a brief introduction to Markov chain Monte Carlo (MCMC) algorithms for sampling from the posterior distribution and describe how they can be combined with numerical solvers for the forward model. We apply these methods to two examples of ODE models: growth curves in ecology, and thermogravimetric analysis (TGA) in chemistry. This is joint work with Matthew Berry, Mark Nelson, Brian Monaghan and Raymond Longbottom.
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
* Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Numerical Approximation of Filtration Processes through Porous MediaRaheel Ahmed
In this MSc thesis, we studied numerical methods for the coupling of free fluid flow with porous medium flow. The free fluid flow is modelled by the Stokes equations while the flow in the porous medium is modelled by Darcy’s law. Appropriate conditions are imposed at the interface between the two regions. The weak formulation of the problem is based on mixed-formulation for Stokes and on a primal-mixed formulation for Darcy equation, incorporating in a natural way the interface conditions. The finite element discretization of the problem leads to large, sparse and ill-conditioned algebraic system to be solved for velocities in both domains, Stokes pressure and piezometric head in porous domain. The system is reduced to interface systems for the normal velocity and piezometric head by a Schur complement approach. We present numerical results for several solution methods based on different preconditioning techniques for the solution of the interface systems. We study the effectiveness of the preconditioners with respect to mesh refinement and physical parameters. An application to cross-flow membranes has been considered. Finally, we also assess the numerical accuracy of an uncoupled algorithm for transient problem, which uses different time steps in the Stokes and in the Darcy domains.
I review how to derive Newtons law of universal gravitation from the Weyl strut between two Chazy-Curzon particles. I also briefly review Causal Dynamical Triangulations (CDT), a method for evaluating the path integral from canonical quantum gravity using Regge calculus and restrictions of the class of simplicial manifolds evaluated to those with a defined time foliation, thus enforcing a causal structure. I then discuss how to apply this approach to Causal Dynamical Triangulations, in particular modifying the algorithm to keep two simplicial submanifolds with curvature (i.e. mass) a fixed distance from each other, modulo regularized deviations and across all time slices. I then discuss how to determine if CDT produces an equivalent Weyl strut, which can then be used to obtain the Newtonian limit. I wrap up with a brief discussion of computational methods and code development.
This is a detailed review of ACM International Collegiate Programming Contest (ICPC) Northeastern European Regional Contest (NEERC) 2015 Problems. It includes a summary of problem and names of problem authors and detailed runs statistics for each problem. Video of the actual presentation that was recorded during NEERC is here https://www.youtube.com/watch?v=vn7v1MuWXdU (in Russian)
Note: there were only preliminary stats avaialble, because problems review was happening before before the closing ceremony. This published presentation has full stats.
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
Stochastic optimal control problems arise in many
applications and are, in principle,
large-scale involving up to millions of decision variables. Their
applicability in control applications is often limited by the
availability of algorithms that can solve them efficiently and within
the sampling time of the controlled system.
In this paper we propose a dual accelerated proximal
gradient algorithm which is amenable to parallelization and
demonstrate that its GPU implementation affords high speed-up
values (with respect to a CPU implementation) and greatly outperforms
well-established commercial optimizers such as Gurobi.
Numerical and analytical studies of single and multiphase starting jets and p...Ruo-Qian (Roger) Wang
Multiphase starting jets and plumes are widely observed in nature and engineering systems. An environmental engineering example is open-water disposal of sediments. The present study numerically simulates such starting jets/plumes using Large Eddy Simulations. The numerical scheme is first validated for single phase plumes, and the relationship between buoyancy and penetration rate is revealed. Then, the trailing stem behind the main cloud is identified, and the the formation number (critical ratio U[delta]t/D, where U, D and [delta]t are discharge velocity, diameter and duration) that determines its presence is determined as a function of plume buoyancy. A unified relationship for starting plumes is developed to describe behaviors from negative to positive buoyancy. In multiphase simulations, two-phase phenomena are clarified including phase separation and the effect of particle release conditions. The most popular similarity law to scale up from the lab to the field (Cloud number scaling) is validated by a series of simulations. Finally, an example of sediment disposal in the field is given based on the present study. In related theoretical analysis, an analytical model on the vortex ring is developed and found to agree well with the direct numerical simulation results.
We formulate the initial value problem to model the evolution of the interface between two fluids of different density in three spatial dimensions. The evolution equations account for the action of gravity on the fluids, surface tension in the fluids and a prescribed far-field conditions.
The flow in each fluid is incompressible and irrotational, so the classical potential theory applies and allows for a boundary integral of dipoles representation. This representation satisfies the kinematic condition of continuous normal velocity and the Laplace-Young condition for the pressure. The dipole strength is related to the jump in potential across the interface. The model of the exact nonlinear three-dimensional motion of the interface is formulated and includes expressions for integral invariants of the motion, the mean height of the interface and the total energy per wavelength.
We develop the numerical method that employes a special generalized isothermal interface parameterization. It enables the use of implicit non-stiff time-integration methods via a small-scale decomposition. Our method includes the efficient algorithms for the generation of initial data with the generalized isothermal parameterization by evolving a flat interface toward a prescribed initial surface shape or by the appropriate choice of the tangential velocities.
The method is used to efficiently compute the nonlinear evolution of a doubly periodic interface separating two fluids in the Rayleigh-Taylor instability and internal waves with surface tension.
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.
1. High Performance High-Order Numerical Methods:
Applications in Ocean Modeling
Ph.D. Thesis Defense
Rajesh Gandham
Advised by: Prof. Tim Warburton
2. http://nctr.pmel.noaa.gov/twebinfo/images/NOAAMethod.png
Tsunami: simulation time
Q: Is “accurate” faster than real time simulation possible?
Q: Can the forward wave problem be solved fast enough for stochastic analysis?
http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_and_tsunami http://www.sms-tsunami-warning.com/ http://www.thedailysheeple.com/nine-years-ago-today-the-
indian-ocean-tsunami_122013
2
4. Kundu, 2007.
Tsunami models
A variety of models for tsunami propagation
4
Ray tracing
• “leading edge” of tsunami wave
• Compute travel times
• No amplitude information
Two-dimensional PDE
• Depth-averaged fluid flow
• Amplitude information
Three-dimensional PDE
• General conservation law
• Full volume information
✅
✅
Compute speed
low
high
Accuracy
high
low
6. Two-dimensional
Shallow Water Modeling
Théorie du mouvement non permanent des eaux, avec application aux crues des rivières et à l'introduction des
marées dans leurs lits, A. J. C. Barré de Saint-Venant, 1871.
7. Governing equations
The Shallow water equations for depth averaged flow
7
∂η
∂t
+
∂(hu)
∂x
+
∂(hv)
∂y
= 0
∂(hu)
∂t
+
∂
∂x
hu2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
huv( )= + fhv −τbx − gh
∂B
∂x
∂(hv)
∂t
+
∂
∂x
huv( )+
∂
∂y
hv2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ = − fhu −τby − gh
∂B
∂y
Free surface
height
x-momentum
y-momentum
Coriolis force
h = η − B
Bh
B < 0
B > 0
η
Bottom friction
Bathymetry
8. Reed and Hill, 1973.
Overview: Hesthaven and Warburton, 2008.
Spatial discretization
Nodal discontinous Galerkin discretization
8
Represent the solution on each triangle with a high-order polynomial.
Solution is discontinuous between triangles.
Why high-order? Accuracy.
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
Q(x) = Qi li (x)
i
∑−6 −4 −2 0 2 4 6 8 10
−6
−4
−2
0
2
4
6
Ω
9. Completed by choosing well-balanced Lax-Friedrich fluxes: Xing, Zhang, and Shu, 2010.
Discretization
Standard discontinuous Galerkin variational form &
Adams-Bashforth time integration
9
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
2. φ
∂Q
∂tDk
∫ =
∂φ
∂x
F
Dk
∫ +
∂φ
∂y
G
Dk
∫ + φS − φ nxF*
+ nyG*
( )
∂Dk, f
∫
Dk
∫
1. Find Q ∈ V Dk
( )( )
3
such that 0 = φ
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
− S
⎛
⎝⎜
⎞
⎠⎟
Dk
∫ for all φ ∈V Dk
( )
nk,1
nk,2
nk,3
Dk
3.
dQk
dt
= rk = N (Qk )+ L(Qk
−
,Qk
+
)
4. Qk
n+1
= Qk
n
+ Δt αsrk
n−s
s=0
2
∑
10. Practical issues
Numerical problems in moving towards a practical simulation
10
Time step restriction:
Stability:
Computational cost:
• Time step is dictated by global CFL condition.
• Fluid height is not guaranteed to be positive.
• Unphysical and unstable.
• High-order representation results in high arithmetic complexity.
Δt ∝ min
Hk
ck
O
TN6
H 3
⎛
⎝⎜
⎞
⎠⎟
11. Our approach: DG-SWE-multi-rate-PP-GPU.
This incomplete literature review is DG centric.
Related work
11
DG:
• Triangular mesh methods for the Neutron transport equation (Reed & Hill, 1973).
• The RKDG methods (I-V) (Cockburn & Shu, 1988-1997).
• Books: Hesthaven & Warburton, 2008. Riviere, 2008. Pietro & Ern, 2012.
DG-SWE:
• DG for 2D flow and transport in shallow water (Aizinger & Dawson, 2002).
• High-order h-adaptive DG for ocean modeling (Bernard, Remacle, et. al., 2007).
DG-SWE-PP:
• A wetting and drying treatment of RKDG solution to the SWE (Bunya, Kubatko, Westerink,& Dawson, 2009).
• Positivity-preserving high-order well-balanced DG methods for SWE (Xing, Zhang, & Shu, 2010).
DG-multi-rate:
• GPU AB multi-rate DG FEM simulation of high-frequency EM fields (Gödel et.al., 2010).
• Multi-rate for explicit DG with applications to geophysical flows (Seny et.al., 2013).
• A local time-stepping RKDG for hurricane storm surge modeling (Dawson, 2014).
Parallel DG:
• Nodal DG on GPUs (Klöckner, Hesthaven, Bridge, & Warburton 2009).
• DG for wave propagation through coupled elastic–acoustic media (Wilcox et.al.,2010).
12. High-order triangles extension of: Bunya, Kubatko, Westerink, and Dawson, 2009.
Xing, Zhang, and Shu, 2010.
Stability: Positivity preservation
Ensure that fluid height is positive via post-processing after each time-step
12
• If mean < cutoff: set the solution to cutoff:
• If mean > cutoff: limit the solution to P1 and limit the slope while preserving the mean:
h
h
13. Test case is adapted from: A. Ern, S. Piperno, and K. Djadel, 2008.
Positivity test: 1D rarefaction
Effect of positivity preserving on the solution accuracy
13
h
x
Fluid height profile
at t = 0
Fluid height profile
at t > 0
• Uniform mesh with elements size .
• Global L2 errors behave like .
• Local L2 errors far from wave front behave like , , for N=1,2,3.
H
O(H1.5
)
O(H 2.2
) O(H 3.0
) O(H 2.9
)
14. Adaptive mesh refinement is indicated as described by Blaise & Giraldo.
Positivity test: 1D rarefaction
Point-wise estimated order of convergence
14
EOC are computed at each point by computing errors on a sequence of meshes.
Decrease in accuracy near the wave front.
EOC
x
t
Point-wise error rate
15. Gear and Wells, 1984.
Gödel, Schomann, Warburton, and Clemens, 2010.
Timestep restriction: Multi-rate integration
Multi-rate multi-step Adams-Bashforth 3rd order
15
• Use varying time-step size:
• Extrapolate the coarse element traces to intermediate time step.
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
• Single-rate vs two-rate speedup: ≈
8 × 2
4 × 2 + 4
≈1.33
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
tn
tn+1
tn+1/2
Δtk ∝
Hk
ck
16. Computational approach: Work partitioning
16
Volume kernel
Surface kernel
Update kernel
PP kernel
Init
End
rk
n
= N (Qk
n
)
+L(Qk
n,−
,Qk
n,+
)
Qk
n+1
= Qk
n
+ Δtk αsrk
n−s
s=0
2
∑
Qk
n+1
= ∏PP Qk
n+1
Elemental
Element coupling
Elemental
Elemental
✅
✅
17. Cubature Quadrature
Dk
Details: Hesthaven and Warburton, 2008.
Cubature database: Cools, 1999.
Flop intense volume & surface kernels
17
Dense matrix-vector products for each triangle
Interpolate Interpolate
Computation on each triangle is independent of the others.
Computation on each node is independent of the others in a triangle.
rk = N (Qk )+ L(Qk
−
,Qk
+
)
18. Klöckner, Warburton, Bridge, and Hesthaven, 2009.
We tune for the optimal #triangles processed by a work group for each kernel.
Mapping on to GPU
18
Fine-grain parallelism of DG operations
Triangle patches are processed by a core.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
Each node is processed by a thread.
19. Figure courtesy: David Medina
Medina and Warburton, 2014. libocca.org
Portability through OCCA
19
Extensive, unified portable multi-threading approach
Kernel Language
x86
Xeon
Phi
AMD
GPU
NVIDIA
GPU
OpenCL
Intel
COI
NVIDIA
CUDA
Application Backends + Hardware
Pthreads
OCCA API IR
OpenCL
NVIDIA
CUDA
Parser
✅
✅
✅
✅
✅
✅
✅
✅
22. Gandham, Medina, and Warburton, 2015.
Simulator performance
22
pasiDG on different architectures
0
60
120
180
240
1 2 3 4 5 6
OCCA:OpenCL, Intel i7
OCCA:OpenMP, Intel i7
0
450
900
1350
1800
1 2 3 4 5 6
OCCA:OpenCL, NVIDIA K40
OCCA:CUDA, NVIDIA K40
OCCA:OpenCL, AMD Tahiti
Mega
Nodes/s
1GPU ≅ 10x (6 CPU cores)
The kernels need further CPU:OpenMP optimization.
Polynomial Order Polynomial Order
23. High-Order vs Low-Order
23
Time (and memory) for accuracy: translating vortex test case
Refine “p” & “H” for memory & compute efficiency.
High-order may not be expensive.
104
105
106
10710
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
10-1
100
101
102
10310
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
L2 error
in fluid
height
Compute time (s) Memory required
24. 2km-35km resolution. Mesh is generated using GMSH.
Thanks to Frank Giraldo and his team for assistance.
Case I: 2004 Indian Ocean Tsunami
Configuration
24
Coastal aligned mesh (130K) Bathymetry data (NOAA )
Initial conditions (Okada model) Multirate (4 levels)
Domain of interest
25. SWE absorbing layers: Modave, Deleersnijder, and Delhez, 2010.
Case I: Simulation
2004 Indian Ocean tsunami simulation using degree 4 triangles
25
• Quartic polynomials in each triangle.
• Absorbing layers near open boundaries.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real-time.
• ~15 mins of simulation time.
India
Malaysia
Madagascar
Fluid heights only in between [-0.4m 0.4m] is shown in the video.
26. DGCOM: DG Coastal Ocean Model, developed by Frank Giraldo’s research group.
Disclaimer: DGCOM timings are from personal conversations with Frank Giraldo.
Case I: Run-time performance
Compute time for Indian Ocean Tsunami benchmark on a single GPU
26
Simulator
Polynomial
degree N
Compute
time
Real time/
compute time
Normalized
#dofs
DGCOM 1 ~8hr 1.25 1
pasiDG 1 1 min 650 1
pasiDG 2 3 min 208 2
pasiDG 3 6 min 95 3.3
pasiDG 4 13 min 47 5
• OCCA:CUDA threading model.
• NVIDIA K40c GPU.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
• Fortran serial implementation.
• Double precision arithmetic.
• Simulation of 10 hrs real time.
pasiDG: DGCOM:
27. Gauge data source: CSIR, National Institute of Oceanography.
Predictions are similar to that of DGCOM. Gopala Krishnan, Averas, and Giraldo.
Case I: Gauge data comparison
Validation with tidal gauge recordings
27
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-200
-150
-100
-50
0
50
100
150
200
250
gauge record
N=1
N=2
N=3
N=4
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
N=1
N=2
N=3
N=4
• Arrival time prediction is reasonable.
• Wave heights need improvement.
• DG schemes are self-consistent.
28. 0.4km-5200km resolution. Mesh is generated using GMSH.
Thanks to Bruno Seny, Université catholique de Louvain, for the mesh. Seny et. al., 2013.
Case II: 2011 Japan Tsunami
Configuration
28
Stereographic mesh (~1.8M) Bathymetry (NOAA)
Initial Conditions (Okada) Multi-rate (12 levels)
Domain of interest
29. *Only 161 elements (< 0.01%) take fine time-step.
Case II: Single-rate vs multi-rate
Multi-rate scheme for efficient time-stepping
29
0
300,000
600,000
900,000
1,200,000
1 2 3 4 5 6 7 8 9 10 11 12
0
3
6
9
12
1 2 3 4 5 6 7 8 9 10 11 12
9.29.49.49.59.59.59.4
9.0
6.4
3.7
1.9
1.0
Kernels are inefficient when launched with fewer elements.
9.5x speedup with 9 levels.
Level # multi-rate levels
Speedup
#triangles
30. SWE in Stereographic coordinates: Lanser, 2002. Dueben, 2012.
Case II: Simulation
2011 Japan tsunami simulation using degree 2 triangles
30
• SWE in stereographic plane.
• Modified CFL condition.
• Quadratic polynomials in each triangle.
• Multi-rate time integration with 9 levels.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real time ~1.5 hrs.
Fluid heights only in between [-0.5m 0.5m] is shown in the video.
Japan
USA
Alaska
Australia
31. SLIM: Second-generation Louvain-la-Neuve Ice-ocean Model.
Thanks to Bruno Seny for SLIM performance results.
Case II: Run-time performance
Compute time for Japan Tsunami benchmark on a single GPU
31
• OCCA:CUDA threading model.
• 1 x NVIDIA K40c GPU.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
• Multi-rate time stepping.
Simulator
Polynomial
degree N
Compute
time
Real time/
compute time
Normalized
#dofs
pasiDG 1 36 min 16 1
SLIM 2 ~75 min 8 2
pasiDG 2 100 min 6 2
pasiDG 3 220 min 2.7 3.3
pasiDG 4 460 min 1.3 5
• MPI parallel CPU code.
• 256 x Intel Xeon(R) E5649 cores.
• Double precision arithmetic.
• Simulation of 10 hrs real time.
• Multi-rate time stepping.
pasiDG: SLIM:
33. Benoit Cushman-Roisin, 2011.
Not considered: Horizontal diffusion, density variation
Governing equations
33
Incompressible, hydrostatic, Boussenesq
∂u
∂x
+
∂v
∂y
+
∂w
∂z
= 0
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ w
∂u
∂z
= −g
∂η
∂x
+υ
∂2
u
∂z2
∂v
∂t
+ u
∂v
∂x
+ v
∂v
∂y
+ w
∂v
∂z
= −g
∂η
∂y
+υ
∂2
v
∂z2
Bh
B < 0
B > 0
ηz
Incompressibility
x-momentum
y-momentum
Free surface height
Vertical diffusion
∂η
∂t
+
∂
∂x
udz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
vdz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ = 0Free surface height
34. Similar approaches: Iskandarani, Haidvogel, Levin, 2003.
Mellor, Hakkinen, Ezer, and Patchen, 2002. Gerdes,1993.
Sigma coordinate system
34
Domain is changing during the simulation
!x
σ
σ = 0
σ = 1
x
z
z = B(x,y)
z = η(x,y)
h(x,y)
• Fixed frame of reference/coordinate system.
• Transform the PDE into sigma coordinate system.
35. Triangular prisms: Maggi, 2011
Spatial discretization
35
Prismatic elements in sigma coordinates
Solution is a tensor-product of 1D and triangle polynomials &
is discontinuous between prisms.
Q(x,y,σ ) = Qij li (x,y)
i, j
∑ lj (σ )
η(x,y) = ηi li (x,y)
i
∑
36. Spatial discretization
36
DG discretization of momentum equations
P1 prisms: Blaise, Comblen, Legat, Remacle, Deleersnijder, and Lambrechts, 2010.
Dk
Ek
1. φψ
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
−υGz
2 ∂2
u
∂σ 2
⎛
⎝⎜
⎞
⎠⎟
Ek
∫ = 0
2. φψ
∂u
∂tEk
∫ + φψ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
⎛
⎝⎜
⎞
⎠⎟
Ek
∫
= − υGz
2
φ
∂ψ
∂σ
∂u
∂σEk
∫ − φψ τ u[ ]+ λ η[ ]( )
∂Eh
k
∫
∂Eh
k
σ
3.
dQk
dt
= Ak + Dk Qk = (uk ,vk )T
for all φ ∈PN
Dk
( ),ψ ∈PNz
[0,1]( )
37. One matrix-free Conjugate Gradient solve per prism element.
Other IMEX approaches: Kang, Oh, Nam, and Giraldo. Comblen et. al., 2010.
Time stepping (IMEX)
37
Explicit advection and implicit diffusion
ηk
n+1
= ηk
n
+ Δt αsrk
n−s
s=0
2
∑
Ik (Qk ,wk ,ηk ) = 0
Qk
n+1
= Qk
n
+ Δt α s
Ak
n−s
s=0
2
∑ + ΔtDk
n+1
Free surface
Momentum
Incompressibility
dηk
dt
= rk
dQk
dt
= Ak + Dk
Explicit
Explicit Implicit
wk
n+1
= (Ik )−1
(Qk
n+1
,ηk
n+1
)
38. Mapping on to GPU-I
38
Dense matrix-vector products for each triangle plane in a prism
for horizontal gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a plane are processed by contiguous threads.
39. Mapping on to GPU-II
39
Dense matrix-vector products for each vertical line in a prism
for vertical gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a vertical line are processed by contiguous threads.
40. Case I: Performance comparison
40
Computational overhead by moving towards 3D simulation
Polynomial
degree Nz
Compute
time
3D/2D
Normalized
#dofs wrt 2D
1 18 min 20 2.3
2 21 min 23 3.3
3 25 min 28 4.3
4 28 min 31 5.3
• Horizontal polynomial N = 1.
• Single rate time stepping.
• OCCA:CUDA model on NVIDIA K40c.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
With kernel tuning and multi-rate time stepping, the projected 3D/2D cost is ~10x for Nz=4.
41. Can we expect better estimates with 3D?
Case I: Gauge data comparison
Validation with tidal gauge recordings
41
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
Simulation results are similar to 2D results.
3D did not resolve wave height discrepancies.
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-150
-100
-50
0
50
100
150
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4
42. Performance discussion
42
Step-by-step performance gain for the benchmark
Optimization Type Speedup
Cumulative
speedup
SWE & DG tailored
implementation
Data structures etc >1x
GPU acceleration Choice of hardware ~10x ~10x
Multi-rate time stepping Choice of algorithm ~10x ~100x
Single precision Choice of precision ~2x ~200x
OCCA:CUDA vs OpenCL Portable software >1x ~200x
Modern hardware, hardware and physics aware algorithms contribute to performance.
43. ALMOND: Algebraic Multi-grid on Numerous Devices.
Thesis summary
43
• Faster than real time simulation with high-order DG on a workstation.
• First high-order DG simulation of tsunami.
• High-order extension of positivity preserving limiter.
• Multi-rate time stepping scheme.
• Focusing on node performance.
• Comparable performance to a 256 core CPU cluster on a single GPU.
• Extension from two-dimensions to three-dimensions.
• Projected over head is ~10x.
• Not shown: Comparison of conservative and non-conservative DG in shallows.
• Not mentioned ALMOND: Fully accelerated & truly portable algebraic multi-grid.
44. Need to hand over this project to someone interested.. :)
Future work
44
Possible extensions to thesis work
• Scalability:
• MPI for simulation on a cluster of GPUs/CPUs.
• Domain decomposition techniques for multi-rate time stepping.
• Numerical algorithms:
• Adaptive mesh refinement.
• Model improvements:
• Fine-grain bathymetry data.
• Stability and error analysis for three-dimensional code.
• Non-hydrostatic model.
45. Acknowledgements
45
The list is not complete…..
Dr. Tim Warburton
Dr. Beatrice Riviere
Dr. William Symes
Dr. Stephen Bradshaw
Thesis Committee
Dr. Francis Giraldo
Dr. Lucas Wilcox
Dr. Paul Fischer
Dr. Mark Ainsworth
Academic Visits
David Medina
Dr. Jesse Chan
Dr. Axel Modave
Dr. Bruno Seny
Dr. Jean-François Remacle
Collaborators
HyPerComp Inc
Stoneridge Technology
Hess Corporation
Internships
Thanks to Cynthia Wood, Jizhou Li, Zheng Wang…