rgDefense

High Performance High-Order Numerical Methods:
Applications in Ocean Modeling
Ph.D. Thesis Defense
Rajesh Gandham
Advised by: Prof. Tim Warburton

http://nctr.pmel.noaa.gov/twebinfo/images/NOAAMethod.png
Tsunami: simulation time
Q: Is “accurate” faster than real time simulation possible?
Q: Can the forward wave problem be solved fast enough for stochastic analysis?
http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_and_tsunami http://www.sms-tsunami-warning.com/ http://www.thedailysheeple.com/nine-years-ago-today-the-
indian-ocean-tsunami_122013
2

Thesis goals
3
• Accurate PDE models and numerical methods.
• Many core hardware architectures.
• Efﬁcient algorithm techniques.
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt

Kundu, 2007.
Tsunami models
A variety of models for tsunami propagation
4
Ray tracing
• “leading edge” of tsunami wave
• Compute travel times
• No amplitude information
Two-dimensional PDE
• Depth-averaged ﬂuid ﬂow
• Amplitude information
Three-dimensional PDE
• General conservation law
• Full volume information
✅
✅
Compute speed
low
high
Accuracy
high
low

http://regmedia.co.uk/2011/03/17/honshu-tsunami-propagation.jpg
http://architosh.com/wp-content/uploads/2013/06/gpu_macpro2013.jpg
http://audilab.bmed.mcgill.ca/AudiLab/teach/fem/anything_35.gif
Extended TeamTalk overview
5
Part 1: Two-dimensional shallow water model
• PDE model and discretization
• Computational approach
Part 2: pasiDG simulator
• Performance results
• Case studies
Part 3: Three-dimensional oceanic model
• PDE model and discretization
• Preliminary performance results
Part 4: Summary & Future work
ANYWHERE

Two-dimensional
Shallow Water Modeling
Théorie du mouvement non permanent des eaux, avec application aux crues des rivières et à l'introduction des
marées dans leurs lits, A. J. C. Barré de Saint-Venant, 1871.

Governing equations
The Shallow water equations for depth averaged ﬂow
7
∂η
∂t
+
∂(hu)
∂x
+
∂(hv)
∂y
= 0
∂(hu)
∂t
+
∂
∂x
hu2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
huv( )= + fhv −τbx − gh
∂B
∂x
∂(hv)
∂t
+
∂
∂x
huv( )+
∂
∂y
hv2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ = − fhu −τby − gh
∂B
∂y
Free surface
height
x-momentum
y-momentum
Coriolis force
h = η − B
Bh
B < 0
B > 0
η
Bottom friction
Bathymetry

Reed and Hill, 1973.
Overview: Hesthaven and Warburton, 2008.
Spatial discretization
Nodal discontinous Galerkin discretization
8
Represent the solution on each triangle with a high-order polynomial.
Solution is discontinuous between triangles.
Why high-order? Accuracy.
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
Q(x) = Qi li (x)
i
∑−6 −4 −2 0 2 4 6 8 10
−6
−4
−2
0
2
4
6
Ω

Completed by choosing well-balanced Lax-Friedrich ﬂuxes: Xing, Zhang, and Shu, 2010.
Discretization
Standard discontinuous Galerkin variational form &
Adams-Bashforth time integration
9
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
2. φ
∂Q
∂tDk
∫ =
∂φ
∂x
F
Dk
∫ +
∂φ
∂y
G
Dk
∫ + φS − φ nxF*
+ nyG*
( )
∂Dk, f
∫
Dk
∫
1. Find Q ∈ V Dk
( )( )
3
such that 0 = φ
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
− S
⎛
⎝⎜
⎞
⎠⎟
Dk
∫ for all φ ∈V Dk
( )
nk,1
nk,2
nk,3
Dk
3.
dQk
dt
= rk = N (Qk )+ L(Qk
−
,Qk
+
)
4. Qk
n+1
= Qk
n
+ Δt αsrk
n−s
s=0
2
∑

Practical issues
Numerical problems in moving towards a practical simulation
10
Time step restriction:
Stability:
Computational cost:
• Time step is dictated by global CFL condition.
• Fluid height is not guaranteed to be positive.
• Unphysical and unstable.
• High-order representation results in high arithmetic complexity.
Δt ∝ min
Hk
ck
O
TN6
H 3
⎛
⎝⎜
⎞
⎠⎟

Our approach: DG-SWE-multi-rate-PP-GPU.
This incomplete literature review is DG centric.
Related work
11
DG:
• Triangular mesh methods for the Neutron transport equation (Reed & Hill, 1973).
• The RKDG methods (I-V) (Cockburn & Shu, 1988-1997).
• Books: Hesthaven & Warburton, 2008. Riviere, 2008. Pietro & Ern, 2012.
DG-SWE:
• DG for 2D flow and transport in shallow water (Aizinger & Dawson, 2002).
• High-order h-adaptive DG for ocean modeling (Bernard, Remacle, et. al., 2007).
DG-SWE-PP:
• A wetting and drying treatment of RKDG solution to the SWE (Bunya, Kubatko, Westerink,& Dawson, 2009).
• Positivity-preserving high-order well-balanced DG methods for SWE (Xing, Zhang, & Shu, 2010).
DG-multi-rate:
• GPU AB multi-rate DG FEM simulation of high-frequency EM fields (Gödel et.al., 2010).
• Multi-rate for explicit DG with applications to geophysical flows (Seny et.al., 2013).
• A local time-stepping RKDG for hurricane storm surge modeling (Dawson, 2014).
Parallel DG:
• Nodal DG on GPUs (Klöckner, Hesthaven, Bridge, & Warburton 2009).
• DG for wave propagation through coupled elastic–acoustic media (Wilcox et.al.,2010).

High-order triangles extension of: Bunya, Kubatko, Westerink, and Dawson, 2009.
Xing, Zhang, and Shu, 2010.
Stability: Positivity preservation
Ensure that ﬂuid height is positive via post-processing after each time-step
12
• If mean < cutoff: set the solution to cutoff:
• If mean > cutoff: limit the solution to P1 and limit the slope while preserving the mean:
h
h

Test case is adapted from: A. Ern, S. Piperno, and K. Djadel, 2008.
Positivity test: 1D rarefaction
Effect of positivity preserving on the solution accuracy
13
h
x
Fluid height proﬁle
at t = 0
Fluid height proﬁle
at t > 0
• Uniform mesh with elements size .
• Global L2 errors behave like .
• Local L2 errors far from wave front behave like , , for N=1,2,3.
H
O(H1.5
)
O(H 2.2
) O(H 3.0
) O(H 2.9
)

Adaptive mesh reﬁnement is indicated as described by Blaise & Giraldo.
Positivity test: 1D rarefaction
Point-wise estimated order of convergence
14
EOC are computed at each point by computing errors on a sequence of meshes.
Decrease in accuracy near the wave front.
EOC
x
t
Point-wise error rate

Gear and Wells, 1984.
Gödel, Schomann, Warburton, and Clemens, 2010.
Timestep restriction: Multi-rate integration
Multi-rate multi-step Adams-Bashforth 3rd order
15
• Use varying time-step size:
• Extrapolate the coarse element traces to intermediate time step.
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
• Single-rate vs two-rate speedup: ≈
8 × 2
4 × 2 + 4
≈1.33
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
tn
tn+1
tn+1/2
Δtk ∝
Hk
ck

Computational approach: Work partitioning
16
Volume kernel
Surface kernel
Update kernel
PP kernel
Init
End
rk
n
= N (Qk
n
)
+L(Qk
n,−
,Qk
n,+
)
Qk
n+1
= Qk
n
+ Δtk αsrk
n−s
s=0
2
∑
Qk
n+1
= ∏PP Qk
n+1
Elemental
Element coupling
Elemental
Elemental
✅
✅

Cubature Quadrature
Dk
Details: Hesthaven and Warburton, 2008.
Cubature database: Cools, 1999.
Flop intense volume & surface kernels
17
Dense matrix-vector products for each triangle
Interpolate Interpolate
Computation on each triangle is independent of the others.
Computation on each node is independent of the others in a triangle.
rk = N (Qk )+ L(Qk
−
,Qk
+
)

Klöckner, Warburton, Bridge, and Hesthaven, 2009.
We tune for the optimal #triangles processed by a work group for each kernel.
Mapping on to GPU
18
Fine-grain parallelism of DG operations
Triangle patches are processed by a core.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
Each node is processed by a thread.

Figure courtesy: David Medina
Medina and Warburton, 2014. libocca.org
Portability through OCCA
19
Extensive, uniﬁed portable multi-threading approach
Kernel Language
x86
Xeon
Phi
AMD
GPU
NVIDIA
GPU
OpenCL
Intel
COI
NVIDIA
CUDA
Application Backends + Hardware
Pthreads
OCCA API IR
OpenCL
NVIDIA
CUDA
Parser
✅
✅
✅
✅
✅
✅
✅
✅

K40: http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf
Tahiti:http://www.legitreviews.com/amd-radeon-hd-7990-6gb-malta-video-card-review_2177
http://ark.intel.com/products/63697/Intel-Core-i7-3930K-Processor-12M-Cache-up-to-3_80-GHz
Test of portability
21
GPUs and multi-core CPUs
NVIDIA-K40 AMD-Tahiti i7-3930K
#FPU 2880 2x2048 6 cores x8
Peak SP (Tﬂop/s) 4.29 2x4.1 ~0.3
Memory (GB) 12 2x3 -
Bandwidth (GB/s) 288 2x288 51
Cost ($) 5000 1000 400
CUDA or
OpenCL
OpenCL
OpenMP or
OpenCL
http://5.grgs.ro/images/products/1/722028/798278/normal/tesla-k40c-12gb-ddr5-384-bit-
c017619440d7fa91e6b1d38b5b41584f.jpg
NVIDIA K40
http://images.anandtech.com/doci/6915/7990Angle.jpg
AMD Tahiti
http://3dprint.com/wp-content/uploads/2015/01/h2.jpg
Intel i7

Gandham, Medina, and Warburton, 2015.
Simulator performance
22
pasiDG on different architectures
0
60
120
180
240
1 2 3 4 5 6
OCCA:OpenCL, Intel i7
OCCA:OpenMP, Intel i7
0
450
900
1350
1800
1 2 3 4 5 6
OCCA:OpenCL, NVIDIA K40
OCCA:CUDA, NVIDIA K40
OCCA:OpenCL, AMD Tahiti
Mega
Nodes/s
1GPU ≅ 10x (6 CPU cores)
The kernels need further CPU:OpenMP optimization.
Polynomial Order Polynomial Order

High-Order vs Low-Order
23
Time (and memory) for accuracy: translating vortex test case
Refine “p” & “H” for memory & compute efficiency.
High-order may not be expensive.
104
105
106
10710
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
10-1
100
101
102
10310
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
L2 error
in fluid
height
Compute time (s) Memory required

2km-35km resolution. Mesh is generated using GMSH.
Thanks to Frank Giraldo and his team for assistance.
Case I: 2004 Indian Ocean Tsunami
Conﬁguration
24
Coastal aligned mesh (130K) Bathymetry data (NOAA )
Initial conditions (Okada model) Multirate (4 levels)
Domain of interest

SWE absorbing layers: Modave, Deleersnijder, and Delhez, 2010.
Case I: Simulation
2004 Indian Ocean tsunami simulation using degree 4 triangles
25
• Quartic polynomials in each triangle.
• Absorbing layers near open boundaries.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real-time.
• ~15 mins of simulation time.
India
Malaysia
Madagascar
Fluid heights only in between [-0.4m 0.4m] is shown in the video.

DGCOM: DG Coastal Ocean Model, developed by Frank Giraldo’s research group.
Disclaimer: DGCOM timings are from personal conversations with Frank Giraldo.
Case I: Run-time performance
Compute time for Indian Ocean Tsunami benchmark on a single GPU
26
Simulator
Polynomial
degree N
Compute 
time
Real time/
compute time
Normalized 
#dofs
DGCOM 1 ~8hr 1.25 1
pasiDG 1 1 min 650 1
pasiDG 3 6 min 95 3.3
• OCCA:CUDA threading model.
• NVIDIA K40c GPU.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
• Fortran serial implementation.
• Double precision arithmetic.
pasiDG: DGCOM:

Gauge data source: CSIR, National Institute of Oceanography.
Predictions are similar to that of DGCOM. Gopala Krishnan, Averas, and Giraldo.
Case I: Gauge data comparison
Validation with tidal gauge recordings
27
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-200
-150
-100
-50
0
50
100
150
200
250
gauge record
N=1
N=2
N=3
N=4
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
N=1
N=2
N=3
N=4
• Arrival time prediction is reasonable.
• Wave heights need improvement.
• DG schemes are self-consistent.

0.4km-5200km resolution. Mesh is generated using GMSH.
Thanks to Bruno Seny, Université catholique de Louvain, for the mesh. Seny et. al., 2013.
Case II: 2011 Japan Tsunami
Conﬁguration
28
Stereographic mesh (~1.8M) Bathymetry (NOAA)
Initial Conditions (Okada) Multi-rate (12 levels)
Domain of interest

*Only 161 elements (< 0.01%) take fine time-step.
Case II: Single-rate vs multi-rate
Multi-rate scheme for efficient time-stepping
29
0
300,000
600,000
900,000
1,200,000
1 2 3 4 5 6 7 8 9 10 11 12
0
3
6
9
12
1 2 3 4 5 6 7 8 9 10 11 12
9.29.49.49.59.59.59.4
9.0
6.4
3.7
1.9
1.0
Kernels are inefficient when launched with fewer elements.
9.5x speedup with 9 levels.
Level # multi-rate levels
Speedup
#triangles

SWE in Stereographic coordinates: Lanser, 2002. Dueben, 2012.
Case II: Simulation
2011 Japan tsunami simulation using degree 2 triangles
30
• SWE in stereographic plane.
• Modiﬁed CFL condition.
• Quadratic polynomials in each triangle.
• Multi-rate time integration with 9 levels.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real time ~1.5 hrs.
Fluid heights only in between [-0.5m 0.5m] is shown in the video.
Japan
USA
Alaska
Australia

SLIM: Second-generation Louvain-la-Neuve Ice-ocean Model.
Thanks to Bruno Seny for SLIM performance results.
Case II: Run-time performance
Compute time for Japan Tsunami benchmark on a single GPU
31
• OCCA:CUDA threading model.
• 1 x NVIDIA K40c GPU.
• Multi-rate time stepping.
Simulator
Polynomial
degree N
Compute 
time
Real time/
compute time
Normalized 
#dofs
SLIM 2 ~75 min 8 2
pasiDG 3 220 min 2.7 3.3
pasiDG 4 460 min 1.3 5
• MPI parallel CPU code.
• 256 x Intel Xeon(R) E5649 cores.
• Double precision arithmetic.
• Multi-rate time stepping.
pasiDG: SLIM:

Benoit Cushman-Roisin, 2011.
Not considered: Horizontal diffusion, density variation
Governing equations
33
Incompressible, hydrostatic, Boussenesq
∂u
∂x
+
∂v
∂y
+
∂w
∂z
= 0
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ w
∂u
∂z
= −g
∂η
∂x
+υ
∂2
u
∂z2
∂v
∂t
+ u
∂v
∂x
+ v
∂v
∂y
+ w
∂v
∂z
= −g
∂η
∂y
+υ
∂2
v
∂z2
Bh
B < 0
B > 0
ηz
Incompressibility
x-momentum
y-momentum
Free surface height
Vertical diffusion
∂η
∂t
+
∂
∂x
udz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
vdz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ = 0Free surface height

Similar approaches: Iskandarani, Haidvogel, Levin, 2003.
Mellor, Hakkinen, Ezer, and Patchen, 2002. Gerdes,1993.
Sigma coordinate system
34
Domain is changing during the simulation
!x
σ
σ = 0
σ = 1
x
z
z = B(x,y)
z = η(x,y)
h(x,y)
• Fixed frame of reference/coordinate system.
• Transform the PDE into sigma coordinate system.

Triangular prisms: Maggi, 2011
35
Prismatic elements in sigma coordinates
Solution is a tensor-product of 1D and triangle polynomials &
is discontinuous between prisms.
Q(x,y,σ ) = Qij li (x,y)
i, j
∑ lj (σ )
η(x,y) = ηi li (x,y)
i
∑

36
DG discretization of momentum equations
P1 prisms: Blaise, Comblen, Legat, Remacle, Deleersnijder, and Lambrechts, 2010.
Dk
Ek
1. φψ
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
−υGz
2 ∂2
u
∂σ 2
⎛
⎝⎜
⎞
⎠⎟
Ek
∫ = 0
2. φψ
∂u
∂tEk
∫ + φψ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
⎛
⎝⎜
⎞
⎠⎟
Ek
∫
= − υGz
2
φ
∂ψ
∂σ
∂u
∂σEk
∫ − φψ τ u[ ]+ λ η[ ]( )
∂Eh
k
∫
∂Eh
k
σ
3.
dQk
dt
= Ak + Dk Qk = (uk ,vk )T
for all φ ∈PN
Dk
( ),ψ ∈PNz
[0,1]( )

One matrix-free Conjugate Gradient solve per prism element.
Other IMEX approaches: Kang, Oh, Nam, and Giraldo. Comblen et. al., 2010.
Time stepping (IMEX)
37
Explicit advection and implicit diffusion
ηk
n+1
= ηk
n
+ Δt αsrk
n−s
s=0
2
∑
Ik (Qk ,wk ,ηk ) = 0
Qk
n+1
= Qk
n
+ Δt α s
Ak
n−s
s=0
2
∑ + ΔtDk
n+1
Free surface
Momentum
Incompressibility
dηk
dt
= rk
dQk
dt
= Ak + Dk
Explicit
Explicit Implicit
wk
n+1
= (Ik )−1
(Qk
n+1
,ηk
n+1
)

Mapping on to GPU-I
38
Dense matrix-vector products for each triangle plane in a prism
for horizontal gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a plane are processed by contiguous threads.

Mapping on to GPU-II
39
Dense matrix-vector products for each vertical line in a prism
for vertical gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a vertical line are processed by contiguous threads.

Case I: Performance comparison
40
Computational overhead by moving towards 3D simulation
Polynomial
degree Nz
Compute 
time
3D/2D
Normalized 
#dofs wrt 2D
1 18 min 20 2.3
2 21 min 23 3.3
3 25 min 28 4.3
4 28 min 31 5.3
• Horizontal polynomial N = 1.
• Single rate time stepping.
• OCCA:CUDA model on NVIDIA K40c.
With kernel tuning and multi-rate time stepping, the projected 3D/2D cost is ~10x for Nz=4.

Can we expect better estimates with 3D?
Case I: Gauge data comparison
Validation with tidal gauge recordings
41
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
Simulation results are similar to 2D results.
3D did not resolve wave height discrepancies.
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4
0 100 200 300 400 500 600
waveheightincm
-150
-100
-50
0
50
100
150
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4

Performance discussion
42
Step-by-step performance gain for the benchmark
Optimization Type Speedup
Cumulative
speedup
SWE & DG tailored
implementation
Data structures etc >1x
GPU acceleration Choice of hardware ~10x ~10x
Multi-rate time stepping Choice of algorithm ~10x ~100x
Single precision Choice of precision ~2x ~200x
OCCA:CUDA vs OpenCL Portable software >1x ~200x
Modern hardware, hardware and physics aware algorithms contribute to performance.

ALMOND: Algebraic Multi-grid on Numerous Devices.
Thesis summary
43
• Faster than real time simulation with high-order DG on a workstation.
• First high-order DG simulation of tsunami.
• High-order extension of positivity preserving limiter.
• Multi-rate time stepping scheme.
• Focusing on node performance.
• Comparable performance to a 256 core CPU cluster on a single GPU.
• Extension from two-dimensions to three-dimensions.
• Projected over head is ~10x.
• Not shown: Comparison of conservative and non-conservative DG in shallows.
• Not mentioned ALMOND: Fully accelerated & truly portable algebraic multi-grid.

Need to hand over this project to someone interested.. :)
Future work
44
Possible extensions to thesis work
• Scalability:
• MPI for simulation on a cluster of GPUs/CPUs.
• Domain decomposition techniques for multi-rate time stepping.
• Numerical algorithms:
• Adaptive mesh reﬁnement.
• Model improvements:
• Fine-grain bathymetry data.
• Stability and error analysis for three-dimensional code.
• Non-hydrostatic model.

Acknowledgements
45
The list is not complete…..
Dr. Tim Warburton
Dr. Beatrice Riviere
Dr. William Symes
Dr. Stephen Bradshaw
Thesis Committee
Dr. Francis Giraldo
Dr. Lucas Wilcox
Dr. Paul Fischer
Dr. Mark Ainsworth
Academic Visits
David Medina
Dr. Jesse Chan
Dr. Axel Modave
Dr. Bruno Seny
Dr. Jean-François Remacle
Collaborators
HyPerComp Inc
Stoneridge Technology
Hess Corporation
Internships
Thanks to Cynthia Wood, Jizhou Li, Zheng Wang…

rgDefense

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to rgDefense

Similar to rgDefense (20)

rgDefense