Mark Britton, DHI 
FMA Conference 
Santa Clara CA, 09/03/14 
© DHI #1 
Parallelization techniques and hardwarefor 2D modelling
Acknowledgements 
© DHI 
•DHI Denmark (Johan Hartnack & Ole Sorensen) 
•DHI New Zealand (Colin Roberts & Greg Whyte) 
•Various HPC providers that have allowed DHI to freely install and test software on their facilities 
#2
Objectives 
© DHI 
•Trying to simplify the language of hardware and programming for specific hardware 
•Share where we (DHI) are at, and where we are going 
•Demonstrate what is possible in 2D modelling 
#3 
Cluster 
CUDA 
Shared 
Memory
© DHI #4 
MIKE21 –Different numerical solutions 
Single Grid(and nested) 
Curvilinear(river morphology) 
Flexible Mesh(triangles & quads)
Model set-up used for bench marking (Mediterranean Sea) 
© DHI #5 
•Flexible Mesh (Finite Volume) explicit code optimizedfor parallelization and distributed simulation
Parallelization –Shared memory approach 
© DHI #6 
•The calculations are carried out on multiple processors on the same PC, all accessing the same memory (Open Multi-Processing or OPENMP).
Parallelization –Shared memory approach 
Incl. Side-feeding 
Excl. Side-feeding 
Number of processors 
Speed up factor 
#7 
Mesh 
No. elements 
__ 
80,968 
__ 
323,029 
__ 
1,292,116 
© DHI
•The calculations are carried out on multiple processors, each with itsown memory space, and the required information is passed between the processors at regular intervals (Message Passing Interface or MPI). 
© DHI #8 
Parallelization –Distributed memory approach
•Basic Concept-Domain decomposition concept(physical sub-domains) -Each processor integrates the equationsin the assigned sub-domain-Data exchange between sub-domainsis based on halo layer/elements concept 
© DHI #9 
Parallelization –Distributed memory approach
© DHI 
Date 
Linux 
Unix 
Mixed 
MS Windows 
BSD based 
June 2013 
95.2% 
3.2% 
0.8% 
0.6% 
0.2%. 
•High performance computing (HPC) has been one of the fastest growing IT-markets within the last five years 
LinuxUnix 
Mixed 
Windows 
BSD 
Mac 
Parallelization –Distributed memory approach 
#10
High Performance Computing 
Speed up factor 
Number of processors 
#11 
© DHI 
Parallelization –Distributed memory approach 
Mesh 
No. elements 
__ 
80,968 
__ 
323,029 
__ 
1,292,116
© DHI #12 
Parallelization –Utilizing GPU technology 
•GeForce GTX TITAN GPU Card, middle of the range gaming card retails for approximately USD$1000
GPU 
#13 
Parallelization –Utilizing GPU technology 
© DHI
•The key calculations (2D) are carried out on the graphics processors. 
•MIKE 21 FM and MIKE FLOOD FM are both GPU enabled (same code) -more products to come 
•Uniquely, for a coupled simulation (1D/2D) in MIKE FLOOD, the 1D calculations (structures/channels) are undertaken on the CPU. 
•It is not possible to scale the degree of parallelization on a GPU-all cores are active all the time-scale using the resolution of the mesh 
•DHI software is optimized for CUDA technology, used in manyGPU cards from the NVIDIA range 
•DHI software can be run in both Single and Double Precision 
#14 
Parallelization –Utilizing GPU technology 
© DHI
Double Precision 
© DHI #15 
Parallelization –Utilizing GPU technology 
_ 
1storder 
_ 
2ndorder
GPU 
GPU 
GPU 
#16 
Hybrid Parallelization –A new frontier 
© DHI
•Combines GPU technology with the MPI technology (a cluster of GPU’s) 
© DHI #17 
Hybrid Parallelization –A new frontier 
IT4Innovation’s AnselmCluster at Ostrava University (Czech Republic, CZ) 
•3344 compute nodes 
•each node has 2 x IntelE5-2665 2.4GHz (16 cores) 
•23 GPU accelerated nodes 
•15 TB RAM
© DHI 
Number of GPUs 
#18 
Mesh 
No. elements 
__ 
323,029 
__ 
1,292,116 
__ 
5,156,238 
Hybrid Parallelization –A new frontier 
Mediterranean SeaDouble Precision
© DHI #19 
Hybrid Parallelization –A new frontier 
Mesh 
Sample Flood Model 
# Elements 
995019 
•A sample flood model used for bench marking(not all elements are wet, less efficient parallelization)
© DHI #20 
Hybrid Parallelization –A new frontier 
MPI 
•Bench marking using a flood model (not all elements are wet) 
Each nodehas 16 cores 
1 million (not all wet) 
1.3 million (all wet) 
0.3 million (all wet) 
80k (all wet)
© DHI #21 
Hybrid Parallelization –A new frontier 
GPU 
Number of GPU nodes 
•Bench marking using a flood model (not all elements are wet) 
1 
2 
4 
8 
16 
1 million (not all wet) 
5.2 million (all wet) 
1.3 million (all wet) 
0.3 million (all wet)
© DHI #22 
Hybrid Parallelization –A new frontier 
•Bench marking using a flood model (not all elements are wet) 
GPU vsMPI 
•1 GPU is about 5xfaster than 16 cores 
•4 GPU’s is about 4xfaster than 64 cores 
•16 GPU’s is nearly 3xfaster than 256 cores 
16 
32 
64 
128 
256 
_ 
MPI 
_ 
GPU 
•4 GPU’s is fasterthan 256 cores 
2 
4 
16 
8 
1
© DHI #23 
Hybrid Parallelization –A case study 
•Christchurch, New Zealand 
Catchment area approx. 420 km2 including three river systems in the model domain: 
Avon River 
Styx River 
HeathcoteRiver 
2D model domain: 
4.2 million elements 
10 m x 10 m resolution flexible mesh (rectangular elements) 
Distributed rainfall-runoff with no losses (rain-on-grid) -extreme rainfall event-21 hour storm
© DHI #24 
Hybrid Parallelization –A case study 
•Christchurch, New Zealand 
Run time on desktop PC (MPI) is 8.9 hours: 
16 core Dell Workstation 
2 x Intel® Xeon® CPU ES- 2687W v2 (8 core, 3.40 GHZ) 
32 GB of RAM 
Windows 7 operating system 
Run time with 1 x GeForce GTX TITAN GPU card is 3.1 hours 
Run time with 2 x GeForce GTX TITAN GPU cards is 1.7 hours
•The mathematical formulation in the GPU versionis identical to the CPU version 
•Coupled models (1D/2D) are enabledin the GPU version, allowing structuresto be modelled, not just 2D flow 
•GPU performance is excellent but highlydependent on the card 
•Optimal performance is achieved for modelswith more than 400,000 elements 
•GPU cards are much cheaper than the equivalentCPU hardware in terms of performance (up to 50x cheaper) 
© DHI #25 
GPU Perspectives
•The use of advanced parallelization techniques are keywhen delivering timely, detailed, accurate and consistenthydrodynamic modellingresults. 
•Large detailed 1D/2D hydrodynamic models can be used inreal-time and near real-time applications like Flood Forecastingand Disaster Risk Management. 
•DHI Software is ready to take full advantage of the next waveof hardware solutions with the Hybrid MPI/GPU approach. 
© DHI #26 
Conclusions
I am a numerical modeller, and my models take a very long time to run. 
My company/department has just invested $$$ in getting me some really fast new computer hardware so I can be more efficient, more productive and/or more profitable. 
Tomorrow I will be super excited because: 
(a)all my current models run so much faster than today 
or 
(b) I can start building even bigger models with even finer resolution. 
?/10 numerical modellerschoose (b) 
© DHI #27 
The Modelling Conundrum…..
Mark Britton 
Global Corporate Relationship Manager 
mfb@dhigroup.com 
© DHI #28 
Thank you for your attention 
ISO Certified for SoftwareDevelopment & Support

Parallelization techniques and hardware for 2D modelling - Mark Britton (DHI)

  • 1.
    Mark Britton, DHI FMA Conference Santa Clara CA, 09/03/14 © DHI #1 Parallelization techniques and hardwarefor 2D modelling
  • 2.
    Acknowledgements © DHI •DHI Denmark (Johan Hartnack & Ole Sorensen) •DHI New Zealand (Colin Roberts & Greg Whyte) •Various HPC providers that have allowed DHI to freely install and test software on their facilities #2
  • 3.
    Objectives © DHI •Trying to simplify the language of hardware and programming for specific hardware •Share where we (DHI) are at, and where we are going •Demonstrate what is possible in 2D modelling #3 Cluster CUDA Shared Memory
  • 4.
    © DHI #4 MIKE21 –Different numerical solutions Single Grid(and nested) Curvilinear(river morphology) Flexible Mesh(triangles & quads)
  • 5.
    Model set-up usedfor bench marking (Mediterranean Sea) © DHI #5 •Flexible Mesh (Finite Volume) explicit code optimizedfor parallelization and distributed simulation
  • 6.
    Parallelization –Shared memoryapproach © DHI #6 •The calculations are carried out on multiple processors on the same PC, all accessing the same memory (Open Multi-Processing or OPENMP).
  • 7.
    Parallelization –Shared memoryapproach Incl. Side-feeding Excl. Side-feeding Number of processors Speed up factor #7 Mesh No. elements __ 80,968 __ 323,029 __ 1,292,116 © DHI
  • 8.
    •The calculations arecarried out on multiple processors, each with itsown memory space, and the required information is passed between the processors at regular intervals (Message Passing Interface or MPI). © DHI #8 Parallelization –Distributed memory approach
  • 9.
    •Basic Concept-Domain decompositionconcept(physical sub-domains) -Each processor integrates the equationsin the assigned sub-domain-Data exchange between sub-domainsis based on halo layer/elements concept © DHI #9 Parallelization –Distributed memory approach
  • 10.
    © DHI Date Linux Unix Mixed MS Windows BSD based June 2013 95.2% 3.2% 0.8% 0.6% 0.2%. •High performance computing (HPC) has been one of the fastest growing IT-markets within the last five years LinuxUnix Mixed Windows BSD Mac Parallelization –Distributed memory approach #10
  • 11.
    High Performance Computing Speed up factor Number of processors #11 © DHI Parallelization –Distributed memory approach Mesh No. elements __ 80,968 __ 323,029 __ 1,292,116
  • 12.
    © DHI #12 Parallelization –Utilizing GPU technology •GeForce GTX TITAN GPU Card, middle of the range gaming card retails for approximately USD$1000
  • 13.
    GPU #13 Parallelization–Utilizing GPU technology © DHI
  • 14.
    •The key calculations(2D) are carried out on the graphics processors. •MIKE 21 FM and MIKE FLOOD FM are both GPU enabled (same code) -more products to come •Uniquely, for a coupled simulation (1D/2D) in MIKE FLOOD, the 1D calculations (structures/channels) are undertaken on the CPU. •It is not possible to scale the degree of parallelization on a GPU-all cores are active all the time-scale using the resolution of the mesh •DHI software is optimized for CUDA technology, used in manyGPU cards from the NVIDIA range •DHI software can be run in both Single and Double Precision #14 Parallelization –Utilizing GPU technology © DHI
  • 15.
    Double Precision ©DHI #15 Parallelization –Utilizing GPU technology _ 1storder _ 2ndorder
  • 16.
    GPU GPU GPU #16 Hybrid Parallelization –A new frontier © DHI
  • 17.
    •Combines GPU technologywith the MPI technology (a cluster of GPU’s) © DHI #17 Hybrid Parallelization –A new frontier IT4Innovation’s AnselmCluster at Ostrava University (Czech Republic, CZ) •3344 compute nodes •each node has 2 x IntelE5-2665 2.4GHz (16 cores) •23 GPU accelerated nodes •15 TB RAM
  • 18.
    © DHI Numberof GPUs #18 Mesh No. elements __ 323,029 __ 1,292,116 __ 5,156,238 Hybrid Parallelization –A new frontier Mediterranean SeaDouble Precision
  • 19.
    © DHI #19 Hybrid Parallelization –A new frontier Mesh Sample Flood Model # Elements 995019 •A sample flood model used for bench marking(not all elements are wet, less efficient parallelization)
  • 20.
    © DHI #20 Hybrid Parallelization –A new frontier MPI •Bench marking using a flood model (not all elements are wet) Each nodehas 16 cores 1 million (not all wet) 1.3 million (all wet) 0.3 million (all wet) 80k (all wet)
  • 21.
    © DHI #21 Hybrid Parallelization –A new frontier GPU Number of GPU nodes •Bench marking using a flood model (not all elements are wet) 1 2 4 8 16 1 million (not all wet) 5.2 million (all wet) 1.3 million (all wet) 0.3 million (all wet)
  • 22.
    © DHI #22 Hybrid Parallelization –A new frontier •Bench marking using a flood model (not all elements are wet) GPU vsMPI •1 GPU is about 5xfaster than 16 cores •4 GPU’s is about 4xfaster than 64 cores •16 GPU’s is nearly 3xfaster than 256 cores 16 32 64 128 256 _ MPI _ GPU •4 GPU’s is fasterthan 256 cores 2 4 16 8 1
  • 23.
    © DHI #23 Hybrid Parallelization –A case study •Christchurch, New Zealand Catchment area approx. 420 km2 including three river systems in the model domain: Avon River Styx River HeathcoteRiver 2D model domain: 4.2 million elements 10 m x 10 m resolution flexible mesh (rectangular elements) Distributed rainfall-runoff with no losses (rain-on-grid) -extreme rainfall event-21 hour storm
  • 24.
    © DHI #24 Hybrid Parallelization –A case study •Christchurch, New Zealand Run time on desktop PC (MPI) is 8.9 hours: 16 core Dell Workstation 2 x Intel® Xeon® CPU ES- 2687W v2 (8 core, 3.40 GHZ) 32 GB of RAM Windows 7 operating system Run time with 1 x GeForce GTX TITAN GPU card is 3.1 hours Run time with 2 x GeForce GTX TITAN GPU cards is 1.7 hours
  • 25.
    •The mathematical formulationin the GPU versionis identical to the CPU version •Coupled models (1D/2D) are enabledin the GPU version, allowing structuresto be modelled, not just 2D flow •GPU performance is excellent but highlydependent on the card •Optimal performance is achieved for modelswith more than 400,000 elements •GPU cards are much cheaper than the equivalentCPU hardware in terms of performance (up to 50x cheaper) © DHI #25 GPU Perspectives
  • 26.
    •The use ofadvanced parallelization techniques are keywhen delivering timely, detailed, accurate and consistenthydrodynamic modellingresults. •Large detailed 1D/2D hydrodynamic models can be used inreal-time and near real-time applications like Flood Forecastingand Disaster Risk Management. •DHI Software is ready to take full advantage of the next waveof hardware solutions with the Hybrid MPI/GPU approach. © DHI #26 Conclusions
  • 27.
    I am anumerical modeller, and my models take a very long time to run. My company/department has just invested $$$ in getting me some really fast new computer hardware so I can be more efficient, more productive and/or more profitable. Tomorrow I will be super excited because: (a)all my current models run so much faster than today or (b) I can start building even bigger models with even finer resolution. ?/10 numerical modellerschoose (b) © DHI #27 The Modelling Conundrum…..
  • 28.
    Mark Britton GlobalCorporate Relationship Manager mfb@dhigroup.com © DHI #28 Thank you for your attention ISO Certified for SoftwareDevelopment & Support