SlideShare a Scribd company logo
1 of 45
Download to read offline
High Performance High-Order Numerical Methods:
Applications in Ocean Modeling
Ph.D. Thesis Defense
Rajesh Gandham
Advised by: Prof. Tim Warburton
http://nctr.pmel.noaa.gov/twebinfo/images/NOAAMethod.png
Tsunami: simulation time
Q: Is “accurate” faster than real time simulation possible?
Q: Can the forward wave problem be solved fast enough for stochastic analysis?
http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_and_tsunami http://www.sms-tsunami-warning.com/ http://www.thedailysheeple.com/nine-years-ago-today-the-
indian-ocean-tsunami_122013
2
Thesis goals
3
• Accurate PDE models and numerical methods.
• Many core hardware architectures.
• Efficient algorithm techniques.
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
Kundu, 2007.
Tsunami models
A variety of models for tsunami propagation
4
Ray tracing
• “leading edge” of tsunami wave
• Compute travel times
• No amplitude information
Two-dimensional PDE
• Depth-averaged fluid flow
• Amplitude information
Three-dimensional PDE
• General conservation law
• Full volume information
✅
✅
Compute speed
low
high
Accuracy
high
low
http://regmedia.co.uk/2011/03/17/honshu-tsunami-propagation.jpg
http://architosh.com/wp-content/uploads/2013/06/gpu_macpro2013.jpg
http://audilab.bmed.mcgill.ca/AudiLab/teach/fem/anything_35.gif
Extended TeamTalk overview
5
Part 1: Two-dimensional shallow water model
• PDE model and discretization
• Computational approach
Part 2: pasiDG simulator
• Performance results
• Case studies
Part 3: Three-dimensional oceanic model
• PDE model and discretization
• Preliminary performance results
Part 4: Summary & Future work
ANYWHERE
Two-dimensional
Shallow Water Modeling
Théorie du mouvement non permanent des eaux, avec application aux crues des rivières et à l'introduction des
marées dans leurs lits, A. J. C. Barré de Saint-Venant, 1871.
Governing equations
The Shallow water equations for depth averaged flow
7
∂η
∂t
+
∂(hu)
∂x
+
∂(hv)
∂y
= 0
∂(hu)
∂t
+
∂
∂x
hu2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
huv( )= + fhv −τbx − gh
∂B
∂x
∂(hv)
∂t
+
∂
∂x
huv( )+
∂
∂y
hv2
+
1
2
gh2⎛
⎝⎜
⎞
⎠⎟ = − fhu −τby − gh
∂B
∂y
Free surface
height
x-momentum
y-momentum
Coriolis force
h = η − B
Bh
B < 0
B > 0
η
Bottom friction
Bathymetry
Reed and Hill, 1973.
Overview: Hesthaven and Warburton, 2008.
Spatial discretization
Nodal discontinous Galerkin discretization
8
Represent the solution on each triangle with a high-order polynomial.
Solution is discontinuous between triangles.
Why high-order? Accuracy.
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
Q(x) = Qi li (x)
i
∑−6 −4 −2 0 2 4 6 8 10
−6
−4
−2
0
2
4
6
Ω
Completed by choosing well-balanced Lax-Friedrich fluxes: Xing, Zhang, and Shu, 2010.
Discretization
Standard discontinuous Galerkin variational form &
Adams-Bashforth time integration
9
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
= S x ∈Ω
2. φ
∂Q
∂tDk
∫ =
∂φ
∂x
F
Dk
∫ +
∂φ
∂y
G
Dk
∫ + φS − φ nxF*
+ nyG*
( )
∂Dk, f
∫
Dk
∫
1. Find Q ∈ V Dk
( )( )
3
such that 0 = φ
∂Q
∂t
+
∂F
∂x
+
∂G
∂y
− S
⎛
⎝⎜
⎞
⎠⎟
Dk
∫ for all φ ∈V Dk
( )
nk,1
nk,2
nk,3
Dk
3.
dQk
dt
= rk = N (Qk )+ L(Qk
−
,Qk
+
)
4. Qk
n+1
= Qk
n
+ Δt αsrk
n−s
s=0
2
∑
Practical issues
Numerical problems in moving towards a practical simulation
10
Time step restriction:
Stability:
Computational cost:
• Time step is dictated by global CFL condition.
• Fluid height is not guaranteed to be positive.
• Unphysical and unstable.
• High-order representation results in high arithmetic complexity.
Δt ∝ min
Hk
ck
O
TN6
H 3
⎛
⎝⎜
⎞
⎠⎟
Our approach: DG-SWE-multi-rate-PP-GPU.
This incomplete literature review is DG centric.
Related work
11
DG:
• Triangular mesh methods for the Neutron transport equation (Reed & Hill, 1973).
• The RKDG methods (I-V) (Cockburn & Shu, 1988-1997).
• Books: Hesthaven & Warburton, 2008. Riviere, 2008. Pietro & Ern, 2012.
DG-SWE:
• DG for 2D flow and transport in shallow water (Aizinger & Dawson, 2002).
• High-order h-adaptive DG for ocean modeling (Bernard, Remacle, et. al., 2007).
DG-SWE-PP:
• A wetting and drying treatment of RKDG solution to the SWE (Bunya, Kubatko, Westerink,& Dawson, 2009).
• Positivity-preserving high-order well-balanced DG methods for SWE (Xing, Zhang, & Shu, 2010).
DG-multi-rate:
• GPU AB multi-rate DG FEM simulation of high-frequency EM fields (Gödel et.al., 2010).
• Multi-rate for explicit DG with applications to geophysical flows (Seny et.al., 2013).
• A local time-stepping RKDG for hurricane storm surge modeling (Dawson, 2014).
Parallel DG:
• Nodal DG on GPUs (Klöckner, Hesthaven, Bridge, & Warburton 2009).
• DG for wave propagation through coupled elastic–acoustic media (Wilcox et.al.,2010).
High-order triangles extension of: Bunya, Kubatko, Westerink, and Dawson, 2009.
Xing, Zhang, and Shu, 2010.
Stability: Positivity preservation
Ensure that fluid height is positive via post-processing after each time-step
12
• If mean < cutoff: set the solution to cutoff:
• If mean > cutoff: limit the solution to P1 and limit the slope while preserving the mean:
h
h
Test case is adapted from: A. Ern, S. Piperno, and K. Djadel, 2008.
Positivity test: 1D rarefaction
Effect of positivity preserving on the solution accuracy
13
h
x
Fluid height profile
at t = 0
Fluid height profile
at t > 0
• Uniform mesh with elements size .
• Global L2 errors behave like .
• Local L2 errors far from wave front behave like , , for N=1,2,3.
H
O(H1.5
)
O(H 2.2
) O(H 3.0
) O(H 2.9
)
Adaptive mesh refinement is indicated as described by Blaise & Giraldo.
Positivity test: 1D rarefaction
Point-wise estimated order of convergence
14
EOC are computed at each point by computing errors on a sequence of meshes.
Decrease in accuracy near the wave front.
EOC
x
t
Point-wise error rate
Gear and Wells, 1984.
Gödel, Schomann, Warburton, and Clemens, 2010.
Timestep restriction: Multi-rate integration
Multi-rate multi-step Adams-Bashforth 3rd order
15
• Use varying time-step size:
• Extrapolate the coarse element traces to intermediate time step.
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
• Single-rate vs two-rate speedup: ≈
8 × 2
4 × 2 + 4
≈1.33
dt dt
dtdt
1
2 dt
1
2 dt
1
2 dt
1
2 dt
tn
tn+1
tn+1/2
Δtk ∝
Hk
ck
Computational approach: Work partitioning
16
Volume kernel
Surface kernel
Update kernel
PP kernel
Init
End
rk
n
= N (Qk
n
)
+L(Qk
n,−
,Qk
n,+
)
Qk
n+1
= Qk
n
+ Δtk αsrk
n−s
s=0
2
∑
Qk
n+1
= ∏PP Qk
n+1
Elemental
Element coupling
Elemental
Elemental
✅
✅
Cubature Quadrature
Dk
Details: Hesthaven and Warburton, 2008.
Cubature database: Cools, 1999.
Flop intense volume & surface kernels
17
Dense matrix-vector products for each triangle
Interpolate Interpolate
Computation on each triangle is independent of the others.
Computation on each node is independent of the others in a triangle.
rk = N (Qk )+ L(Qk
−
,Qk
+
)
Klöckner, Warburton, Bridge, and Hesthaven, 2009.
We tune for the optimal #triangles processed by a work group for each kernel.
Mapping on to GPU
18
Fine-grain parallelism of DG operations
Triangle patches are processed by a core.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
Each node is processed by a thread.
Figure courtesy: David Medina
Medina and Warburton, 2014. libocca.org
Portability through OCCA
19
Extensive, unified portable multi-threading approach
Kernel Language
x86
Xeon
Phi
AMD
GPU
NVIDIA
GPU
OpenCL
Intel
COI
NVIDIA
CUDA
Application Backends + Hardware
Pthreads
OCCA API IR
OpenCL
NVIDIA
CUDA
Parser
✅
✅
✅
✅
✅
✅
✅
✅
pasiDG simulator
K40: http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf
Tahiti:http://www.legitreviews.com/amd-radeon-hd-7990-6gb-malta-video-card-review_2177
http://ark.intel.com/products/63697/Intel-Core-i7-3930K-Processor-12M-Cache-up-to-3_80-GHz
Test of portability
21
GPUs and multi-core CPUs
NVIDIA-K40 AMD-Tahiti i7-3930K
#FPU 2880 2x2048 6 cores x8
Peak SP (Tflop/s) 4.29 2x4.1 ~0.3
Memory (GB) 12 2x3 -
Bandwidth (GB/s) 288 2x288 51
Cost ($) 5000 1000 400
CUDA or
OpenCL
OpenCL
OpenMP or
OpenCL
http://5.grgs.ro/images/products/1/722028/798278/normal/tesla-k40c-12gb-ddr5-384-bit-
c017619440d7fa91e6b1d38b5b41584f.jpg
NVIDIA K40
http://images.anandtech.com/doci/6915/7990Angle.jpg
AMD Tahiti
http://3dprint.com/wp-content/uploads/2015/01/h2.jpg
Intel i7
Gandham, Medina, and Warburton, 2015.
Simulator performance
22
pasiDG on different architectures
0
60
120
180
240
1 2 3 4 5 6
OCCA:OpenCL, Intel i7
OCCA:OpenMP, Intel i7
0
450
900
1350
1800
1 2 3 4 5 6
OCCA:OpenCL, NVIDIA K40
OCCA:CUDA, NVIDIA K40
OCCA:OpenCL, AMD Tahiti
Mega
Nodes/s
1GPU ≅ 10x (6 CPU cores)
The kernels need further CPU:OpenMP optimization.
Polynomial Order Polynomial Order
High-Order vs Low-Order
23
Time (and memory) for accuracy: translating vortex test case
Refine “p” & “H” for memory & compute efficiency.
High-order may not be expensive.
104
105
106
10710
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
10-1
100
101
102
10310
-7
10-6
10-5
10-4
10-3
10
-2
10
-1
N=1
N=2
N=3
N=4
N=5
L2 error
in fluid
height
Compute time (s) Memory required
2km-35km resolution. Mesh is generated using GMSH.
Thanks to Frank Giraldo and his team for assistance.
Case I: 2004 Indian Ocean Tsunami
Configuration
24
Coastal aligned mesh (130K) Bathymetry data (NOAA )
Initial conditions (Okada model) Multirate (4 levels)
Domain of interest
SWE absorbing layers: Modave, Deleersnijder, and Delhez, 2010.
Case I: Simulation
2004 Indian Ocean tsunami simulation using degree 4 triangles
25
• Quartic polynomials in each triangle.
• Absorbing layers near open boundaries.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real-time.
• ~15 mins of simulation time.
India
Malaysia
Madagascar
Fluid heights only in between [-0.4m 0.4m] is shown in the video.
DGCOM: DG Coastal Ocean Model, developed by Frank Giraldo’s research group.
Disclaimer: DGCOM timings are from personal conversations with Frank Giraldo.
Case I: Run-time performance
Compute time for Indian Ocean Tsunami benchmark on a single GPU
26
Simulator
Polynomial
degree N
Compute

time
Real time/
compute time
Normalized

#dofs
DGCOM 1 ~8hr 1.25 1
pasiDG 1 1 min 650 1
pasiDG 2 3 min 208 2
pasiDG 3 6 min 95 3.3
pasiDG 4 13 min 47 5
• OCCA:CUDA threading model.
• NVIDIA K40c GPU.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
• Fortran serial implementation.
• Double precision arithmetic.
• Simulation of 10 hrs real time.
pasiDG: DGCOM:
Gauge data source: CSIR, National Institute of Oceanography.
Predictions are similar to that of DGCOM. Gopala Krishnan, Averas, and Giraldo.
Case I: Gauge data comparison
Validation with tidal gauge recordings
27
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-200
-150
-100
-50
0
50
100
150
200
250
gauge record
N=1
N=2
N=3
N=4
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
N=1
N=2
N=3
N=4
• Arrival time prediction is reasonable.
• Wave heights need improvement.
• DG schemes are self-consistent.
0.4km-5200km resolution. Mesh is generated using GMSH.
Thanks to Bruno Seny, Université catholique de Louvain, for the mesh. Seny et. al., 2013.
Case II: 2011 Japan Tsunami
Configuration
28
Stereographic mesh (~1.8M) Bathymetry (NOAA)
Initial Conditions (Okada) Multi-rate (12 levels)
Domain of interest
*Only 161 elements (< 0.01%) take fine time-step.
Case II: Single-rate vs multi-rate
Multi-rate scheme for efficient time-stepping
29
0
300,000
600,000
900,000
1,200,000
1 2 3 4 5 6 7 8 9 10 11 12
0
3
6
9
12
1 2 3 4 5 6 7 8 9 10 11 12
9.29.49.49.59.59.59.4
9.0
6.4
3.7
1.9
1.0
Kernels are inefficient when launched with fewer elements.
9.5x speedup with 9 levels.
Level # multi-rate levels
Speedup
#triangles
SWE in Stereographic coordinates: Lanser, 2002. Dueben, 2012.
Case II: Simulation
2011 Japan tsunami simulation using degree 2 triangles
30
• SWE in stereographic plane.
• Modified CFL condition.
• Quadratic polynomials in each triangle.
• Multi-rate time integration with 9 levels.
• Bottom friction is critical for stability.
• OCCA:CUDA on NVIDIA K40 GPU.
• 10 hrs of real time ~1.5 hrs.
Fluid heights only in between [-0.5m 0.5m] is shown in the video.
Japan
USA
Alaska
Australia
SLIM: Second-generation Louvain-la-Neuve Ice-ocean Model.
Thanks to Bruno Seny for SLIM performance results.
Case II: Run-time performance
Compute time for Japan Tsunami benchmark on a single GPU
31
• OCCA:CUDA threading model.
• 1 x NVIDIA K40c GPU.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
• Multi-rate time stepping.
Simulator
Polynomial
degree N
Compute

time
Real time/
compute time
Normalized

#dofs
pasiDG 1 36 min 16 1
SLIM 2 ~75 min 8 2
pasiDG 2 100 min 6 2
pasiDG 3 220 min 2.7 3.3
pasiDG 4 460 min 1.3 5
• MPI parallel CPU code.
• 256 x Intel Xeon(R) E5649 cores.
• Double precision arithmetic.
• Simulation of 10 hrs real time.
• Multi-rate time stepping.
pasiDG: SLIM:
Three-dimensional model
Benoit Cushman-Roisin, 2011.
Not considered: Horizontal diffusion, density variation
Governing equations
33
Incompressible, hydrostatic, Boussenesq
∂u
∂x
+
∂v
∂y
+
∂w
∂z
= 0
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ w
∂u
∂z
= −g
∂η
∂x
+υ
∂2
u
∂z2
∂v
∂t
+ u
∂v
∂x
+ v
∂v
∂y
+ w
∂v
∂z
= −g
∂η
∂y
+υ
∂2
v
∂z2
Bh
B < 0
B > 0
ηz
Incompressibility
x-momentum
y-momentum
Free surface height
Vertical diffusion
∂η
∂t
+
∂
∂x
udz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ +
∂
∂y
vdz
B
η
∫
⎛
⎝⎜
⎞
⎠⎟ = 0Free surface height
Similar approaches: Iskandarani, Haidvogel, Levin, 2003.
Mellor, Hakkinen, Ezer, and Patchen, 2002. Gerdes,1993.
Sigma coordinate system
34
Domain is changing during the simulation
!x
σ
σ = 0
σ = 1
x
z
z = B(x,y)
z = η(x,y)
h(x,y)
• Fixed frame of reference/coordinate system.
• Transform the PDE into sigma coordinate system.
Triangular prisms: Maggi, 2011
Spatial discretization
35
Prismatic elements in sigma coordinates
Solution is a tensor-product of 1D and triangle polynomials &
is discontinuous between prisms.
Q(x,y,σ ) = Qij li (x,y)
i, j
∑ lj (σ )
η(x,y) = ηi li (x,y)
i
∑
Spatial discretization
36
DG discretization of momentum equations
P1 prisms: Blaise, Comblen, Legat, Remacle, Deleersnijder, and Lambrechts, 2010.
Dk
Ek
1. φψ
∂u
∂t
+ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
−υGz
2 ∂2
u
∂σ 2
⎛
⎝⎜
⎞
⎠⎟
Ek
∫ = 0
2. φψ
∂u
∂tEk
∫ + φψ u
∂u
∂x
+ v
∂u
∂y
+ wG
∂u
∂σ
+ g
∂η
∂x
⎛
⎝⎜
⎞
⎠⎟
Ek
∫
= − υGz
2
φ
∂ψ
∂σ
∂u
∂σEk
∫ − φψ τ u[ ]+ λ η[ ]( )
∂Eh
k
∫
∂Eh
k
σ
3.
dQk
dt
= Ak + Dk Qk = (uk ,vk )T
for all φ ∈PN
Dk
( ),ψ ∈PNz
[0,1]( )
One matrix-free Conjugate Gradient solve per prism element.
Other IMEX approaches: Kang, Oh, Nam, and Giraldo. Comblen et. al., 2010.
Time stepping (IMEX)
37
Explicit advection and implicit diffusion
ηk
n+1
= ηk
n
+ Δt αsrk
n−s
s=0
2
∑
Ik (Qk ,wk ,ηk ) = 0
Qk
n+1
= Qk
n
+ Δt α s
Ak
n−s
s=0
2
∑ + ΔtDk
n+1
Free surface
Momentum
Incompressibility
dηk
dt
= rk
dQk
dt
= Ak + Dk
Explicit
Explicit Implicit
wk
n+1
= (Ik )−1
(Qk
n+1
,ηk
n+1
)
Mapping on to GPU-I
38
Dense matrix-vector products for each triangle plane in a prism
for horizontal gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a plane are processed by contiguous threads.
Mapping on to GPU-II
39
Dense matrix-vector products for each vertical line in a prism
for vertical gradients
Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton.
Global Memory
Core 0
SharedShared Shared
Core 1 Core 2
• Group of prisms are processed by a core.
• Each node is processed by a 2D thread.
• Nodes in a vertical line are processed by contiguous threads.
Case I: Performance comparison
40
Computational overhead by moving towards 3D simulation
Polynomial
degree Nz
Compute

time
3D/2D
Normalized

#dofs wrt 2D
1 18 min 20 2.3
2 21 min 23 3.3
3 25 min 28 4.3
4 28 min 31 5.3
• Horizontal polynomial N = 1.
• Single rate time stepping.
• OCCA:CUDA model on NVIDIA K40c.
• Single precision arithmetic.
• Simulation of 10 hrs real time.
With kernel tuning and multi-rate time stepping, the projected 3D/2D cost is ~10x for Nz=4.
Can we expect better estimates with 3D?
Case I: Gauge data comparison
Validation with tidal gauge recordings
41
Chennai Station
(80.30E, 13.10N)
Mormugao Station
(73.80E, 15.42N)
Simulation results are similar to 2D results.
3D did not resolve wave height discrepancies.
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-400
-300
-200
-100
0
100
200
300
400
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4
minutes after Earthquake
0 100 200 300 400 500 600
waveheightincm
-150
-100
-50
0
50
100
150
gauge record
2D
Nz=1
Nz=2
Nz=3
Nz=4
Performance discussion
42
Step-by-step performance gain for the benchmark
Optimization Type Speedup
Cumulative
speedup
SWE & DG tailored
implementation
Data structures etc >1x
GPU acceleration Choice of hardware ~10x ~10x
Multi-rate time stepping Choice of algorithm ~10x ~100x
Single precision Choice of precision ~2x ~200x
OCCA:CUDA vs OpenCL Portable software >1x ~200x
Modern hardware, hardware and physics aware algorithms contribute to performance.
ALMOND: Algebraic Multi-grid on Numerous Devices.
Thesis summary
43
• Faster than real time simulation with high-order DG on a workstation.
• First high-order DG simulation of tsunami.
• High-order extension of positivity preserving limiter.
• Multi-rate time stepping scheme.
• Focusing on node performance.
• Comparable performance to a 256 core CPU cluster on a single GPU.
• Extension from two-dimensions to three-dimensions.
• Projected over head is ~10x.
• Not shown: Comparison of conservative and non-conservative DG in shallows.
• Not mentioned ALMOND: Fully accelerated & truly portable algebraic multi-grid.
Need to hand over this project to someone interested.. :)
Future work
44
Possible extensions to thesis work
• Scalability:
• MPI for simulation on a cluster of GPUs/CPUs.
• Domain decomposition techniques for multi-rate time stepping.
• Numerical algorithms:
• Adaptive mesh refinement.
• Model improvements:
• Fine-grain bathymetry data.
• Stability and error analysis for three-dimensional code.
• Non-hydrostatic model.
Acknowledgements
45
The list is not complete…..
Dr. Tim Warburton
Dr. Beatrice Riviere
Dr. William Symes
Dr. Stephen Bradshaw
Thesis Committee
Dr. Francis Giraldo
Dr. Lucas Wilcox
Dr. Paul Fischer
Dr. Mark Ainsworth
Academic Visits
David Medina
Dr. Jesse Chan
Dr. Axel Modave
Dr. Bruno Seny
Dr. Jean-François Remacle
Collaborators
HyPerComp Inc
Stoneridge Technology
Hess Corporation
Internships
Thanks to Cynthia Wood, Jizhou Li, Zheng Wang…

More Related Content

What's hot

presentation
presentationpresentation
presentationjie ren
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionNÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionKai Katsumata
 
Smart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand ManagementSmart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand ManagementPantelis Sopasakis
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
 
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition MethodStationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition MethodADVENTURE Project
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
 
Zero-Forcing Precoding and Generalized Inverses
Zero-Forcing Precoding and Generalized InversesZero-Forcing Precoding and Generalized Inverses
Zero-Forcing Precoding and Generalized InversesDaniel Tai
 
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...Emre Barlas
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physicsSayed Ahmed
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsMatt Moores
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Hansol Kang
 
Numerical Approximation of Filtration Processes through Porous Media
Numerical Approximation of Filtration Processes through Porous MediaNumerical Approximation of Filtration Processes through Porous Media
Numerical Approximation of Filtration Processes through Porous MediaRaheel Ahmed
 
Fuzzy clustering and merging
Fuzzy clustering and mergingFuzzy clustering and merging
Fuzzy clustering and mergingabc
 

What's hot (18)

presentation
presentationpresentation
presentation
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionNÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
 
Smart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand ManagementSmart Systems for Urban Water Demand Management
Smart Systems for Urban Water Demand Management
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideas
 
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition MethodStationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
UCI Seminar
UCI SeminarUCI Seminar
UCI Seminar
 
Zero-Forcing Precoding and Generalized Inverses
Zero-Forcing Precoding and Generalized InversesZero-Forcing Precoding and Generalized Inverses
Zero-Forcing Precoding and Generalized Inverses
 
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...
Development of a Pseudo-Spectral 3D Navier Stokes Solver for Wind Turbine App...
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physics
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
 
Jvm heap
Jvm heapJvm heap
Jvm heap
 
Numerical Approximation of Filtration Processes through Porous Media
Numerical Approximation of Filtration Processes through Porous MediaNumerical Approximation of Filtration Processes through Porous Media
Numerical Approximation of Filtration Processes through Porous Media
 
Ponchon Savarait
Ponchon SavaraitPonchon Savarait
Ponchon Savarait
 
Fuzzy clustering and merging
Fuzzy clustering and mergingFuzzy clustering and merging
Fuzzy clustering and merging
 

Similar to rgDefense

Aghora A High-Order DG Solver for Turbulent Flow Simulations.pdf
Aghora  A High-Order DG Solver for Turbulent Flow Simulations.pdfAghora  A High-Order DG Solver for Turbulent Flow Simulations.pdf
Aghora A High-Order DG Solver for Turbulent Flow Simulations.pdfSandra Valenzuela
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...Hiroki Nakahara
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
April 2015 APS presentation
April 2015 APS presentationApril 2015 APS presentation
April 2015 APS presentationAdam Getchell
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems ReviewRoman Elizarov
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSCdongwook159
 
Numerical and analytical studies of single and multiphase starting jets and p...
Numerical and analytical studies of single and multiphase starting jets and p...Numerical and analytical studies of single and multiphase starting jets and p...
Numerical and analytical studies of single and multiphase starting jets and p...Ruo-Qian (Roger) Wang
 
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.A non-stiff numerical method for 3D interfacial flow of inviscid fluids.
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.Alex (Oleksiy) Varfolomiyev
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataRobert Oostenveld
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfShreeDevi42
 

Similar to rgDefense (20)

Aghora A High-Order DG Solver for Turbulent Flow Simulations.pdf
Aghora  A High-Order DG Solver for Turbulent Flow Simulations.pdfAghora  A High-Order DG Solver for Turbulent Flow Simulations.pdf
Aghora A High-Order DG Solver for Turbulent Flow Simulations.pdf
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
April 2015 APS presentation
April 2015 APS presentationApril 2015 APS presentation
April 2015 APS presentation
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSC
 
Numerical and analytical studies of single and multiphase starting jets and p...
Numerical and analytical studies of single and multiphase starting jets and p...Numerical and analytical studies of single and multiphase starting jets and p...
Numerical and analytical studies of single and multiphase starting jets and p...
 
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.A non-stiff numerical method for 3D interfacial flow of inviscid fluids.
A non-stiff numerical method for 3D interfacial flow of inviscid fluids.
 
SPAA11
SPAA11SPAA11
SPAA11
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
 

rgDefense

  • 1. High Performance High-Order Numerical Methods: Applications in Ocean Modeling Ph.D. Thesis Defense Rajesh Gandham Advised by: Prof. Tim Warburton
  • 2. http://nctr.pmel.noaa.gov/twebinfo/images/NOAAMethod.png Tsunami: simulation time Q: Is “accurate” faster than real time simulation possible? Q: Can the forward wave problem be solved fast enough for stochastic analysis? http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_and_tsunami http://www.sms-tsunami-warning.com/ http://www.thedailysheeple.com/nine-years-ago-today-the- indian-ocean-tsunami_122013 2
  • 3. Thesis goals 3 • Accurate PDE models and numerical methods. • Many core hardware architectures. • Efficient algorithm techniques. dt dt dtdt 1 2 dt 1 2 dt 1 2 dt 1 2 dt
  • 4. Kundu, 2007. Tsunami models A variety of models for tsunami propagation 4 Ray tracing • “leading edge” of tsunami wave • Compute travel times • No amplitude information Two-dimensional PDE • Depth-averaged fluid flow • Amplitude information Three-dimensional PDE • General conservation law • Full volume information ✅ ✅ Compute speed low high Accuracy high low
  • 5. http://regmedia.co.uk/2011/03/17/honshu-tsunami-propagation.jpg http://architosh.com/wp-content/uploads/2013/06/gpu_macpro2013.jpg http://audilab.bmed.mcgill.ca/AudiLab/teach/fem/anything_35.gif Extended TeamTalk overview 5 Part 1: Two-dimensional shallow water model • PDE model and discretization • Computational approach Part 2: pasiDG simulator • Performance results • Case studies Part 3: Three-dimensional oceanic model • PDE model and discretization • Preliminary performance results Part 4: Summary & Future work ANYWHERE
  • 6. Two-dimensional Shallow Water Modeling Théorie du mouvement non permanent des eaux, avec application aux crues des rivières et à l'introduction des marées dans leurs lits, A. J. C. Barré de Saint-Venant, 1871.
  • 7. Governing equations The Shallow water equations for depth averaged flow 7 ∂η ∂t + ∂(hu) ∂x + ∂(hv) ∂y = 0 ∂(hu) ∂t + ∂ ∂x hu2 + 1 2 gh2⎛ ⎝⎜ ⎞ ⎠⎟ + ∂ ∂y huv( )= + fhv −τbx − gh ∂B ∂x ∂(hv) ∂t + ∂ ∂x huv( )+ ∂ ∂y hv2 + 1 2 gh2⎛ ⎝⎜ ⎞ ⎠⎟ = − fhu −τby − gh ∂B ∂y Free surface height x-momentum y-momentum Coriolis force h = η − B Bh B < 0 B > 0 η Bottom friction Bathymetry
  • 8. Reed and Hill, 1973. Overview: Hesthaven and Warburton, 2008. Spatial discretization Nodal discontinous Galerkin discretization 8 Represent the solution on each triangle with a high-order polynomial. Solution is discontinuous between triangles. Why high-order? Accuracy. ∂Q ∂t + ∂F ∂x + ∂G ∂y = S x ∈Ω Q(x) = Qi li (x) i ∑−6 −4 −2 0 2 4 6 8 10 −6 −4 −2 0 2 4 6 Ω
  • 9. Completed by choosing well-balanced Lax-Friedrich fluxes: Xing, Zhang, and Shu, 2010. Discretization Standard discontinuous Galerkin variational form & Adams-Bashforth time integration 9 ∂Q ∂t + ∂F ∂x + ∂G ∂y = S x ∈Ω 2. φ ∂Q ∂tDk ∫ = ∂φ ∂x F Dk ∫ + ∂φ ∂y G Dk ∫ + φS − φ nxF* + nyG* ( ) ∂Dk, f ∫ Dk ∫ 1. Find Q ∈ V Dk ( )( ) 3 such that 0 = φ ∂Q ∂t + ∂F ∂x + ∂G ∂y − S ⎛ ⎝⎜ ⎞ ⎠⎟ Dk ∫ for all φ ∈V Dk ( ) nk,1 nk,2 nk,3 Dk 3. dQk dt = rk = N (Qk )+ L(Qk − ,Qk + ) 4. Qk n+1 = Qk n + Δt αsrk n−s s=0 2 ∑
  • 10. Practical issues Numerical problems in moving towards a practical simulation 10 Time step restriction: Stability: Computational cost: • Time step is dictated by global CFL condition. • Fluid height is not guaranteed to be positive. • Unphysical and unstable. • High-order representation results in high arithmetic complexity. Δt ∝ min Hk ck O TN6 H 3 ⎛ ⎝⎜ ⎞ ⎠⎟
  • 11. Our approach: DG-SWE-multi-rate-PP-GPU. This incomplete literature review is DG centric. Related work 11 DG: • Triangular mesh methods for the Neutron transport equation (Reed & Hill, 1973). • The RKDG methods (I-V) (Cockburn & Shu, 1988-1997). • Books: Hesthaven & Warburton, 2008. Riviere, 2008. Pietro & Ern, 2012. DG-SWE: • DG for 2D flow and transport in shallow water (Aizinger & Dawson, 2002). • High-order h-adaptive DG for ocean modeling (Bernard, Remacle, et. al., 2007). DG-SWE-PP: • A wetting and drying treatment of RKDG solution to the SWE (Bunya, Kubatko, Westerink,& Dawson, 2009). • Positivity-preserving high-order well-balanced DG methods for SWE (Xing, Zhang, & Shu, 2010). DG-multi-rate: • GPU AB multi-rate DG FEM simulation of high-frequency EM fields (Gödel et.al., 2010). • Multi-rate for explicit DG with applications to geophysical flows (Seny et.al., 2013). • A local time-stepping RKDG for hurricane storm surge modeling (Dawson, 2014). Parallel DG: • Nodal DG on GPUs (Klöckner, Hesthaven, Bridge, & Warburton 2009). • DG for wave propagation through coupled elastic–acoustic media (Wilcox et.al.,2010).
  • 12. High-order triangles extension of: Bunya, Kubatko, Westerink, and Dawson, 2009. Xing, Zhang, and Shu, 2010. Stability: Positivity preservation Ensure that fluid height is positive via post-processing after each time-step 12 • If mean < cutoff: set the solution to cutoff: • If mean > cutoff: limit the solution to P1 and limit the slope while preserving the mean: h h
  • 13. Test case is adapted from: A. Ern, S. Piperno, and K. Djadel, 2008. Positivity test: 1D rarefaction Effect of positivity preserving on the solution accuracy 13 h x Fluid height profile at t = 0 Fluid height profile at t > 0 • Uniform mesh with elements size . • Global L2 errors behave like . • Local L2 errors far from wave front behave like , , for N=1,2,3. H O(H1.5 ) O(H 2.2 ) O(H 3.0 ) O(H 2.9 )
  • 14. Adaptive mesh refinement is indicated as described by Blaise & Giraldo. Positivity test: 1D rarefaction Point-wise estimated order of convergence 14 EOC are computed at each point by computing errors on a sequence of meshes. Decrease in accuracy near the wave front. EOC x t Point-wise error rate
  • 15. Gear and Wells, 1984. Gödel, Schomann, Warburton, and Clemens, 2010. Timestep restriction: Multi-rate integration Multi-rate multi-step Adams-Bashforth 3rd order 15 • Use varying time-step size: • Extrapolate the coarse element traces to intermediate time step. dt dt dtdt 1 2 dt 1 2 dt 1 2 dt 1 2 dt dt dt dtdt 1 2 dt 1 2 dt 1 2 dt 1 2 dt • Single-rate vs two-rate speedup: ≈ 8 × 2 4 × 2 + 4 ≈1.33 dt dt dtdt 1 2 dt 1 2 dt 1 2 dt 1 2 dt tn tn+1 tn+1/2 Δtk ∝ Hk ck
  • 16. Computational approach: Work partitioning 16 Volume kernel Surface kernel Update kernel PP kernel Init End rk n = N (Qk n ) +L(Qk n,− ,Qk n,+ ) Qk n+1 = Qk n + Δtk αsrk n−s s=0 2 ∑ Qk n+1 = ∏PP Qk n+1 Elemental Element coupling Elemental Elemental ✅ ✅
  • 17. Cubature Quadrature Dk Details: Hesthaven and Warburton, 2008. Cubature database: Cools, 1999. Flop intense volume & surface kernels 17 Dense matrix-vector products for each triangle Interpolate Interpolate Computation on each triangle is independent of the others. Computation on each node is independent of the others in a triangle. rk = N (Qk )+ L(Qk − ,Qk + )
  • 18. Klöckner, Warburton, Bridge, and Hesthaven, 2009. We tune for the optimal #triangles processed by a work group for each kernel. Mapping on to GPU 18 Fine-grain parallelism of DG operations Triangle patches are processed by a core. Global Memory Core 0 SharedShared Shared Core 1 Core 2 Each node is processed by a thread.
  • 19. Figure courtesy: David Medina Medina and Warburton, 2014. libocca.org Portability through OCCA 19 Extensive, unified portable multi-threading approach Kernel Language x86 Xeon Phi AMD GPU NVIDIA GPU OpenCL Intel COI NVIDIA CUDA Application Backends + Hardware Pthreads OCCA API IR OpenCL NVIDIA CUDA Parser ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
  • 21. K40: http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf Tahiti:http://www.legitreviews.com/amd-radeon-hd-7990-6gb-malta-video-card-review_2177 http://ark.intel.com/products/63697/Intel-Core-i7-3930K-Processor-12M-Cache-up-to-3_80-GHz Test of portability 21 GPUs and multi-core CPUs NVIDIA-K40 AMD-Tahiti i7-3930K #FPU 2880 2x2048 6 cores x8 Peak SP (Tflop/s) 4.29 2x4.1 ~0.3 Memory (GB) 12 2x3 - Bandwidth (GB/s) 288 2x288 51 Cost ($) 5000 1000 400 CUDA or OpenCL OpenCL OpenMP or OpenCL http://5.grgs.ro/images/products/1/722028/798278/normal/tesla-k40c-12gb-ddr5-384-bit- c017619440d7fa91e6b1d38b5b41584f.jpg NVIDIA K40 http://images.anandtech.com/doci/6915/7990Angle.jpg AMD Tahiti http://3dprint.com/wp-content/uploads/2015/01/h2.jpg Intel i7
  • 22. Gandham, Medina, and Warburton, 2015. Simulator performance 22 pasiDG on different architectures 0 60 120 180 240 1 2 3 4 5 6 OCCA:OpenCL, Intel i7 OCCA:OpenMP, Intel i7 0 450 900 1350 1800 1 2 3 4 5 6 OCCA:OpenCL, NVIDIA K40 OCCA:CUDA, NVIDIA K40 OCCA:OpenCL, AMD Tahiti Mega Nodes/s 1GPU ≅ 10x (6 CPU cores) The kernels need further CPU:OpenMP optimization. Polynomial Order Polynomial Order
  • 23. High-Order vs Low-Order 23 Time (and memory) for accuracy: translating vortex test case Refine “p” & “H” for memory & compute efficiency. High-order may not be expensive. 104 105 106 10710 -7 10-6 10-5 10-4 10-3 10 -2 10 -1 N=1 N=2 N=3 N=4 N=5 10-1 100 101 102 10310 -7 10-6 10-5 10-4 10-3 10 -2 10 -1 N=1 N=2 N=3 N=4 N=5 L2 error in fluid height Compute time (s) Memory required
  • 24. 2km-35km resolution. Mesh is generated using GMSH. Thanks to Frank Giraldo and his team for assistance. Case I: 2004 Indian Ocean Tsunami Configuration 24 Coastal aligned mesh (130K) Bathymetry data (NOAA ) Initial conditions (Okada model) Multirate (4 levels) Domain of interest
  • 25. SWE absorbing layers: Modave, Deleersnijder, and Delhez, 2010. Case I: Simulation 2004 Indian Ocean tsunami simulation using degree 4 triangles 25 • Quartic polynomials in each triangle. • Absorbing layers near open boundaries. • Bottom friction is critical for stability. • OCCA:CUDA on NVIDIA K40 GPU. • 10 hrs of real-time. • ~15 mins of simulation time. India Malaysia Madagascar Fluid heights only in between [-0.4m 0.4m] is shown in the video.
  • 26. DGCOM: DG Coastal Ocean Model, developed by Frank Giraldo’s research group. Disclaimer: DGCOM timings are from personal conversations with Frank Giraldo. Case I: Run-time performance Compute time for Indian Ocean Tsunami benchmark on a single GPU 26 Simulator Polynomial degree N Compute
 time Real time/ compute time Normalized
 #dofs DGCOM 1 ~8hr 1.25 1 pasiDG 1 1 min 650 1 pasiDG 2 3 min 208 2 pasiDG 3 6 min 95 3.3 pasiDG 4 13 min 47 5 • OCCA:CUDA threading model. • NVIDIA K40c GPU. • Single precision arithmetic. • Simulation of 10 hrs real time. • Fortran serial implementation. • Double precision arithmetic. • Simulation of 10 hrs real time. pasiDG: DGCOM:
  • 27. Gauge data source: CSIR, National Institute of Oceanography. Predictions are similar to that of DGCOM. Gopala Krishnan, Averas, and Giraldo. Case I: Gauge data comparison Validation with tidal gauge recordings 27 Chennai Station (80.30E, 13.10N) Mormugao Station (73.80E, 15.42N) minutes after Earthquake 0 100 200 300 400 500 600 waveheightincm -200 -150 -100 -50 0 50 100 150 200 250 gauge record N=1 N=2 N=3 N=4 minutes after Earthquake 0 100 200 300 400 500 600 waveheightincm -400 -300 -200 -100 0 100 200 300 400 gauge record N=1 N=2 N=3 N=4 • Arrival time prediction is reasonable. • Wave heights need improvement. • DG schemes are self-consistent.
  • 28. 0.4km-5200km resolution. Mesh is generated using GMSH. Thanks to Bruno Seny, Université catholique de Louvain, for the mesh. Seny et. al., 2013. Case II: 2011 Japan Tsunami Configuration 28 Stereographic mesh (~1.8M) Bathymetry (NOAA) Initial Conditions (Okada) Multi-rate (12 levels) Domain of interest
  • 29. *Only 161 elements (< 0.01%) take fine time-step. Case II: Single-rate vs multi-rate Multi-rate scheme for efficient time-stepping 29 0 300,000 600,000 900,000 1,200,000 1 2 3 4 5 6 7 8 9 10 11 12 0 3 6 9 12 1 2 3 4 5 6 7 8 9 10 11 12 9.29.49.49.59.59.59.4 9.0 6.4 3.7 1.9 1.0 Kernels are inefficient when launched with fewer elements. 9.5x speedup with 9 levels. Level # multi-rate levels Speedup #triangles
  • 30. SWE in Stereographic coordinates: Lanser, 2002. Dueben, 2012. Case II: Simulation 2011 Japan tsunami simulation using degree 2 triangles 30 • SWE in stereographic plane. • Modified CFL condition. • Quadratic polynomials in each triangle. • Multi-rate time integration with 9 levels. • Bottom friction is critical for stability. • OCCA:CUDA on NVIDIA K40 GPU. • 10 hrs of real time ~1.5 hrs. Fluid heights only in between [-0.5m 0.5m] is shown in the video. Japan USA Alaska Australia
  • 31. SLIM: Second-generation Louvain-la-Neuve Ice-ocean Model. Thanks to Bruno Seny for SLIM performance results. Case II: Run-time performance Compute time for Japan Tsunami benchmark on a single GPU 31 • OCCA:CUDA threading model. • 1 x NVIDIA K40c GPU. • Single precision arithmetic. • Simulation of 10 hrs real time. • Multi-rate time stepping. Simulator Polynomial degree N Compute
 time Real time/ compute time Normalized
 #dofs pasiDG 1 36 min 16 1 SLIM 2 ~75 min 8 2 pasiDG 2 100 min 6 2 pasiDG 3 220 min 2.7 3.3 pasiDG 4 460 min 1.3 5 • MPI parallel CPU code. • 256 x Intel Xeon(R) E5649 cores. • Double precision arithmetic. • Simulation of 10 hrs real time. • Multi-rate time stepping. pasiDG: SLIM:
  • 33. Benoit Cushman-Roisin, 2011. Not considered: Horizontal diffusion, density variation Governing equations 33 Incompressible, hydrostatic, Boussenesq ∂u ∂x + ∂v ∂y + ∂w ∂z = 0 ∂u ∂t + u ∂u ∂x + v ∂u ∂y + w ∂u ∂z = −g ∂η ∂x +υ ∂2 u ∂z2 ∂v ∂t + u ∂v ∂x + v ∂v ∂y + w ∂v ∂z = −g ∂η ∂y +υ ∂2 v ∂z2 Bh B < 0 B > 0 ηz Incompressibility x-momentum y-momentum Free surface height Vertical diffusion ∂η ∂t + ∂ ∂x udz B η ∫ ⎛ ⎝⎜ ⎞ ⎠⎟ + ∂ ∂y vdz B η ∫ ⎛ ⎝⎜ ⎞ ⎠⎟ = 0Free surface height
  • 34. Similar approaches: Iskandarani, Haidvogel, Levin, 2003. Mellor, Hakkinen, Ezer, and Patchen, 2002. Gerdes,1993. Sigma coordinate system 34 Domain is changing during the simulation !x σ σ = 0 σ = 1 x z z = B(x,y) z = η(x,y) h(x,y) • Fixed frame of reference/coordinate system. • Transform the PDE into sigma coordinate system.
  • 35. Triangular prisms: Maggi, 2011 Spatial discretization 35 Prismatic elements in sigma coordinates Solution is a tensor-product of 1D and triangle polynomials & is discontinuous between prisms. Q(x,y,σ ) = Qij li (x,y) i, j ∑ lj (σ ) η(x,y) = ηi li (x,y) i ∑
  • 36. Spatial discretization 36 DG discretization of momentum equations P1 prisms: Blaise, Comblen, Legat, Remacle, Deleersnijder, and Lambrechts, 2010. Dk Ek 1. φψ ∂u ∂t + u ∂u ∂x + v ∂u ∂y + wG ∂u ∂σ + g ∂η ∂x −υGz 2 ∂2 u ∂σ 2 ⎛ ⎝⎜ ⎞ ⎠⎟ Ek ∫ = 0 2. φψ ∂u ∂tEk ∫ + φψ u ∂u ∂x + v ∂u ∂y + wG ∂u ∂σ + g ∂η ∂x ⎛ ⎝⎜ ⎞ ⎠⎟ Ek ∫ = − υGz 2 φ ∂ψ ∂σ ∂u ∂σEk ∫ − φψ τ u[ ]+ λ η[ ]( ) ∂Eh k ∫ ∂Eh k σ 3. dQk dt = Ak + Dk Qk = (uk ,vk )T for all φ ∈PN Dk ( ),ψ ∈PNz [0,1]( )
  • 37. One matrix-free Conjugate Gradient solve per prism element. Other IMEX approaches: Kang, Oh, Nam, and Giraldo. Comblen et. al., 2010. Time stepping (IMEX) 37 Explicit advection and implicit diffusion ηk n+1 = ηk n + Δt αsrk n−s s=0 2 ∑ Ik (Qk ,wk ,ηk ) = 0 Qk n+1 = Qk n + Δt α s Ak n−s s=0 2 ∑ + ΔtDk n+1 Free surface Momentum Incompressibility dηk dt = rk dQk dt = Ak + Dk Explicit Explicit Implicit wk n+1 = (Ik )−1 (Qk n+1 ,ηk n+1 )
  • 38. Mapping on to GPU-I 38 Dense matrix-vector products for each triangle plane in a prism for horizontal gradients Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton. Global Memory Core 0 SharedShared Shared Core 1 Core 2 • Group of prisms are processed by a core. • Each node is processed by a 2D thread. • Nodes in a plane are processed by contiguous threads.
  • 39. Mapping on to GPU-II 39 Dense matrix-vector products for each vertical line in a prism for vertical gradients Similar approach: vNek spectral element code, Remacle, Gandham, and Warburton. Global Memory Core 0 SharedShared Shared Core 1 Core 2 • Group of prisms are processed by a core. • Each node is processed by a 2D thread. • Nodes in a vertical line are processed by contiguous threads.
  • 40. Case I: Performance comparison 40 Computational overhead by moving towards 3D simulation Polynomial degree Nz Compute
 time 3D/2D Normalized
 #dofs wrt 2D 1 18 min 20 2.3 2 21 min 23 3.3 3 25 min 28 4.3 4 28 min 31 5.3 • Horizontal polynomial N = 1. • Single rate time stepping. • OCCA:CUDA model on NVIDIA K40c. • Single precision arithmetic. • Simulation of 10 hrs real time. With kernel tuning and multi-rate time stepping, the projected 3D/2D cost is ~10x for Nz=4.
  • 41. Can we expect better estimates with 3D? Case I: Gauge data comparison Validation with tidal gauge recordings 41 Chennai Station (80.30E, 13.10N) Mormugao Station (73.80E, 15.42N) Simulation results are similar to 2D results. 3D did not resolve wave height discrepancies. minutes after Earthquake 0 100 200 300 400 500 600 waveheightincm -400 -300 -200 -100 0 100 200 300 400 gauge record 2D Nz=1 Nz=2 Nz=3 Nz=4 minutes after Earthquake 0 100 200 300 400 500 600 waveheightincm -150 -100 -50 0 50 100 150 gauge record 2D Nz=1 Nz=2 Nz=3 Nz=4
  • 42. Performance discussion 42 Step-by-step performance gain for the benchmark Optimization Type Speedup Cumulative speedup SWE & DG tailored implementation Data structures etc >1x GPU acceleration Choice of hardware ~10x ~10x Multi-rate time stepping Choice of algorithm ~10x ~100x Single precision Choice of precision ~2x ~200x OCCA:CUDA vs OpenCL Portable software >1x ~200x Modern hardware, hardware and physics aware algorithms contribute to performance.
  • 43. ALMOND: Algebraic Multi-grid on Numerous Devices. Thesis summary 43 • Faster than real time simulation with high-order DG on a workstation. • First high-order DG simulation of tsunami. • High-order extension of positivity preserving limiter. • Multi-rate time stepping scheme. • Focusing on node performance. • Comparable performance to a 256 core CPU cluster on a single GPU. • Extension from two-dimensions to three-dimensions. • Projected over head is ~10x. • Not shown: Comparison of conservative and non-conservative DG in shallows. • Not mentioned ALMOND: Fully accelerated & truly portable algebraic multi-grid.
  • 44. Need to hand over this project to someone interested.. :) Future work 44 Possible extensions to thesis work • Scalability: • MPI for simulation on a cluster of GPUs/CPUs. • Domain decomposition techniques for multi-rate time stepping. • Numerical algorithms: • Adaptive mesh refinement. • Model improvements: • Fine-grain bathymetry data. • Stability and error analysis for three-dimensional code. • Non-hydrostatic model.
  • 45. Acknowledgements 45 The list is not complete….. Dr. Tim Warburton Dr. Beatrice Riviere Dr. William Symes Dr. Stephen Bradshaw Thesis Committee Dr. Francis Giraldo Dr. Lucas Wilcox Dr. Paul Fischer Dr. Mark Ainsworth Academic Visits David Medina Dr. Jesse Chan Dr. Axel Modave Dr. Bruno Seny Dr. Jean-François Remacle Collaborators HyPerComp Inc Stoneridge Technology Hess Corporation Internships Thanks to Cynthia Wood, Jizhou Li, Zheng Wang…