SlideShare a Scribd company logo
1 of 28
Download to read offline
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
1
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
2
IBM Research @ Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
3
IBM Research @ Daresbury, UK – STFC Partnership Mission
• 2015: £313 million investment over next 5 years
• Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence
in the UK
• Product and Services Agreement with IBM UK and Ireland
• Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson
cognitive computing platform
• Joint commercialization of intellectual property assets produced in the partnership
Help the UK industries and institutions bringing cutting-edge computational science,
engineering and applicable technologies, such as data-centric cognitive computing, to
boost growth and development of the UK economy
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
4
IBM Research @ Daresbury, UK – People
7
Over 26 computational
scientists and engineers
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
5
IBM Research @ Daresbury, UK – Research areas
• Case studies:
– Smart Crop Protection -
Precision Agriculture
• Data science + Life sciences
– Improving disease diagnostics
and personalised treatments
• Life sciences + Machine learning
– Cognitive treatment plant
• Engineering + Machine learning
– Parameterisation of engineering
models
• Engineering + Machine learning
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
6
Task-based GPU acceleration in Computational
Fluid Dynamics with OpenMP 4.5 and CUDA in
OpenPOWER platforms.
OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain
June 2018
Samuel Antao
IBM Research, Daresbury, UK
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
7
CFD and Algebraic-Multigrid
• Solve set of partial-differential equations over several
time steps
– Discretization:
• Unstructured vs Structured
– Equations:
• Velocity
• Pressure
• Turbulence
• Iterative solvers
– Jacobi
– Gauss-Seidel
– Conjugate Gradient
• Multigrid approaches
– Solve the problem at different resolutions
• Coarse and fine grids/meshes
– Less Iterations for fine grids
• Algebraic multigrid (AMG) – encode mesh
information in algebraic format
– Sparse matrices.
source: http://web.utk.edu/~wfeng1/research.html
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
8
CFD and Algebraic-Multigrid
source: http://web.utk.edu/~wfeng1/research.html
NVLINKTM
NVLINKTM
NVLINKTM
NVLINKTM
InfiniBandTM
MPI rank
• Grid partitioned by
MPI ranks
• Ranks distributed by
nodes
• More than one rank
executing in one
node
• Challenges:
– Different grids have
different compute
needs
– Access strides vary,
unstructured data
accesses.
– CPU-GPU data
movements
– Regular
communication
between ranks
• Halo elements
• Residuals
• Synchronizations
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
9
CFD and Algebraic-Multigrid – Code Saturne
• Open-source – developed and maintained by EDF
• 350K lines of code:
– 50% C – 37% Fortran – 13% Python
• Rich ecosystem to configure/parameterise simulations, generate meshes
• History of good scalability
Cores Time in
Solver
Efficiency
262,144 789.79 s -
524,288 403.18 s 97%
MPI Tasks Time in
Solver
Efficiency
524,288 70.114 s -
1,048,576 52.574 s 66%
1,572,864 45.731 s 76%
105B Cell Mesh (MIRA, BGQ)
13B Cell Mesh (MIRA, BGQ)
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
10
CFD and Algebraic-Multigrid – Execution time distribution
• Many components (kernels)
contribute to total execution
time
• There are data dependencies
between consecutive kernels
• There are opportunities to
keep data in the device
between kernels
• Some kernels may have
lower compute intensity, it
could still be worthwhile
computing them in the GPU if
the data is already there
Gauss-Seidel	solver
(Velocity)
Other
Matrix-vector	mult.	
MSR
Matrix-vector	mult.	
CSR
Dot	products
Multigrid	setup
Compute	coarse	cells	
from	fine	cells
Other	AMG-related
Pressure
(AMG)
Single	thread	profiling	- Code	Saturne	5.0+
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
11
Directive-based programming models
• Porting existing code to accelerators is time consuming…
• The sooner we have code running in the GPU the sooner you can start …
– … learning where overheads are
– … identifying what data patterns are being used
– … spotting kernels performing poorly
– … making decisions on what strategies can be used to improve performance
• Directive-based programming models can get you started much quicker
– Don’t need to bother about device memory allocation and data pointers
– Implementation defaults already exploiting device features
– Easily create data environments where data resides in the GPU
– Improve your code portability
• Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support
• PGI C/C++/Fortran compiler provide OpenACC support
• Can be complemented with existing GPU accelerated libraries
– cuSparse
– AMGx
XL
clang
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
12
Directive-based programming models
• OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient
IBM Confidential GPU acceleration in Code Saturne
1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */)
2 {
3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 
4 /* Move result vector to device and copied it back at the ned of the scope */ 
5 map(tofrom:vx[: vx_size ]) 
6 /* Move right -hand side vector to the device */ 
7 map(to:rhs [: n_rows ]) 
8 /* Allocate all auxiliary vectors in the device */ 
9 map(alloc: _aux_vectors [: tmp_size ])
10 {
11
12 /* Solver code */
13
14 }
15 }
Listing 2: OpenMP 4.5 data environment for a level of the AMG solver.
during the computation of a level so it can be copied to the device at the beginning of the level. The result
vector can also be kept in the device for a significant part of the execution, and only has to be copied to
the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned
observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos
All arrays reside in the device in this scope!
The programming model manages host/device pointers mapping for you!
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
13
Directive-based programming models
• OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products
6 vx[ii] += (alpha * dk[ii]);
7 rk[ii] += (alpha * zk[ii]);
8 }
9
10 /* ... */
11 }
12
13 /* ... */
14
15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n,
16 const cs_real_t *restrict x,
17 const cs_real_t *restrict y,
18 double *xx ,
19 double *xy)
20 {
21 double dot_xx = 0.0, dot_xy = 0.0;
22
23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 
24 if ( n > GPU_THRESHOLD ) 
25 map(to:x[:n],y[:n]) 
26 map(tofrom:dot_xx , dot_xy)
27 for (cs_lnum_t i = 0; i < n; ++i) {
28 const double tx = x[i];
29 const double ty = y[i];
30 dot_xx += tx*tx;
31 dot_xy += tx*ty;
32 }
33
34 /* ... */
35
36 *xx = dot_xx;
37 *xy = dot_xy;
38 }
Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product .
…
Host
… … CUDA blocks
OpenMP team
Allocate data
in the the
device.
Host
Release data
in the the
device.
OpenMP
runtime library
OpenMP
runtime library
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
14
Directive-based programming models
• OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline
AMG
cycle
AMG
coarse
grid
detail
Allocations of
small
variables
High kernel
launch latency
Back-to-back
kernels
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
15
CUDA-based tuning
• Avoid expensive GPU memory allocation/deallocation:
– Allocate a memory chunk once and reuse it
• Use pinned memory for data copied frequently to the
GPU
– Avoid pageable-pinned memory copies by the CUDA
implementation
• Explore asynchronous execution of CUDA API calls
– Start copying data to/from the device while the host is preparing
the next set of data or the next kernel
• Use CUDA constant memory to copy arguments for
multiple kernels at once.
– The latency of copying tens of KB to the GPU is similar to
copy 1B
– Dual-buffering enable copies to happen asynchronously
• Produce specialized kernels instead of relying on runtime
checks.
– CUDA is a C++ extension and therefore kernels and device
functions can be templated.
– Leverage compile-time optimizations for the relevant
sequences of kernels.
– NVCC toolchain does very aggressive inlining.
– Lower register pressure = more occupancy.
IBM Confidential GPU acceleration in Cod
1 template < KernelKinds Kind >
2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) {
3 switch(Kind) {
4 /* ... */
5 // Dot product:
6 //
7 case DP_xx:
8 dot_product <Kind >(
9 /* version */ Arg.getArg <cs_lnum_t >(0),
10 /* n_rows */ Arg.getArg <cs_lnum_t >(1),
11 /* x */ Arg.getArg <cs_real_t * >(2),
12 /* y */ nullptr ,
13 /* z */ nullptr ,
14 /* res */ Arg.getArg <cs_real_t * >(3),
15 /* n_rows_per_block */ n_rows_per_block );
16 break;
17 /* ... */
18 }
19 __syncthreads ();
20 return 0;
21 }
22
23 template < KernelKinds ... Kinds >
24 __global__ void any_kernels (void) {
25
26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]);
27 const unsigned n_rows_per_block = KA -> RowsPerBlock ;
28 unsigned idx = 0;
29
30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... };
31 (void) dummy;
32 }
Listing 10: Device entry-point function for kernel execution.
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
16
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – Results for a single rank
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
57.21
43.83
34.77
30.28
49.86
37.37
29.67
25.63
11.87 10.83 9.55 9.32
4.41 4.34 4.40 4.63
0
10
20
30
40
50
60
70
1 2 4 8
Time	(seconds)
OpenMP	threads
Wall	time	CPU Solver	time	CPU Wall	time	CPU+GPU Solver	time	CPU+GPU
4.82
4.05
3.64
3.25
11.29
8.60
6.74
5.53
0
2
4
6
8
10
12
1 2 4 8
GPU	speedup	(1x)
OpenMP	threads
Wall	time Solvers	time
Execution time Speed up
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
17
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
Gauss-Seidel
AMG
fine grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
18
CUDA-based tuning
• CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.)
– IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid
AMG
coarse grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
19
MPI and GPU acceleration
• Different processes (MPI ranks) will use different CUDA contexts.
• CUDA implementation serializes CUDA contexts by default.
• NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the
same GPU.
MPS server
instance
GPU driver
Define
Visible GPU
Start MPS
server
Execute
application
Terminate
MPS server
Define
Visible GPU
Execute
application
Define
Visible GPU
Execute
application
Rank 0
Rank 1
Rank 2
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
20
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
Gauss-Seidel
Hiding data
movement
latencies
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
21
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid
AMG
coarse grid
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
22
MPI and GPU acceleration
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.)
– IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes
Execution time Speed up
2.39 2.42
2.32
2.22
2.08
2.00
2.53
2.57
2.45
2.37
2.20
2.10
1.5
1.7
1.9
2.1
2.3
2.5
2.7
1 2 4 8 16 32
Speedup	over	CPU-only	(1x)
Nodes
Wall	time Solvers	time
717.6
369.9
187.4
100.2
54.9
28.9
693.9
358.0
181.5
97.2
53.4
28.1
300.4
153.1
80.8
45.2
26.4
14.4
274.7
139.6
74.0
41.1
24.3
13.410.0
100.0
1000.0
1 2 4 8 16 32
Time	(seconds)
Nodes
CPU	wall	time CPU	solvers	time CPU+GPU	wall	time CPU+GPU	solvers	time
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
23
POWER8 to POWER9
• A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs
– No major code refactoring needed.
– More powerful CPUs, GPUs and interconnect.
• Some differences to consider:
– Core vs Pairs of Cores
• POWER9 L3 cache and store queue is shared for each pair of cores
• SMT4 per core or SMT8 per pair-of-cores
– V100 (Volta) drops lock-step execution per warp-threads
• One program-counter per thread
• If code assumes lock-step execution explicit barriers have to be inserted
• No guarantee threads will converge after divergence within a warp
• One has to leverage cooperative groups and thread activity masks
NVLINKTM
NVLINKTM
NVLINKTM
ORNL Summit Socket
(2 sockets per node)
for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) {
// Depending on the number of rows - warps may diverge here
unsigned AM = __activemask();
…
for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2)
sii += __shfl_down_sync(AM, sii,kk, bdimx);
…
}
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
24
POWER8 to POWER9
• CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node
– IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes
2.31
2.90
2.34
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9
3.1
64 256 512
Speedup	over	CPU-only	(1x)
Nodes
Wall	time
74.73
21.04
11.16
32.3
7.25
4.76
1.0
10.0
100.0
64 256 512
Time	(seconds)
Nodes
CPU	wall	time CPU+GPU	wall	time
Execution time Speed up
POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
25
CFD and AI
• Cognitive Enhanced Design
– Design / prototype new pieces of equipment is
expensive (time and finance)
• Parameter sweeps need several expensive simulations
• Want to make decisions faster
• Make decisions on more complex problems
– Use cognitive techniques (e.g. Bayesian neural
networks) to generate a model based on a
parameterized space to relate design parameters to
performance. Use this in Bayesian optimization to
improve design
• Converge to optimal parameters more quickly
– Example: airfoil optimization: Lift/Drag maximization
• Adaptive Expected Improvement (EI) converges faster and
with less variance.
Work package ML1 Cognitive Enhanced Design
§ Problem: Design / prototyping of new pieces
of equipment can be expensive (time and
finance). Want to do more work in silico, and
also use an ‘intelligent’ design process
§ Solution: Use cognitive techniques (e.g.
Bayesian neural networks) to generate a
model based on a parameterized space to relate
design parameters to performance. Use this in
Bayesian optimization to improve design.
a
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
26
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection
§ Problem: One typical bottleneck in the
simulation-led workflow, is the analysis of
the output produced by the simulation itself
– especially the identification of features
(e.g. for flow; separation, swirl, layering).
§ Solution: Extend deep-feature detection to
3-dimensional problems to remove this
bottleneck from the design workflow
§ Additional extensions planned for the
semantic querying of simulation data, contextual
event classification, and computational steering for
rare-event simulation
Workpackage ML3 Enhanced 3D-Feature Detection
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
27
CFD and AI
• Enhanced 3D-Feature Detection
– Typical bottleneck of the design process is the
simulation-led workflow analysis
• Extract feature like:
– flow
– separation
– swirl
– layering
– Extend AI techniques to automatically extract features
in 3D
• Remove analysis bottlenecks
• Semantic querying of simulation data
• Contextual event classification
• Computational steering for rare-event simulation
– Example: Racing car vortex detection
• AI-enabled feature detection
© 2018 IBM Corporation
Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms.
28
Questions?
samuel.antao@ibm.com

More Related Content

What's hot

OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AILablup Inc.
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningIndrajit Poddar
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 

What's hot (20)

OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 

Similar to CFD on Power

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance ObservationsAdam Roberts
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)Amazon Web Services
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 

Similar to CFD on Power (20)

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

CFD on Power

  • 1. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 1 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  • 2. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 2 IBM Research @ Daresbury, UK
  • 3. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 3 IBM Research @ Daresbury, UK – STFC Partnership Mission • 2015: £313 million investment over next 5 years • Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence in the UK • Product and Services Agreement with IBM UK and Ireland • Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson cognitive computing platform • Joint commercialization of intellectual property assets produced in the partnership Help the UK industries and institutions bringing cutting-edge computational science, engineering and applicable technologies, such as data-centric cognitive computing, to boost growth and development of the UK economy
  • 4. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 4 IBM Research @ Daresbury, UK – People 7 Over 26 computational scientists and engineers
  • 5. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 5 IBM Research @ Daresbury, UK – Research areas • Case studies: – Smart Crop Protection - Precision Agriculture • Data science + Life sciences – Improving disease diagnostics and personalised treatments • Life sciences + Machine learning – Cognitive treatment plant • Engineering + Machine learning – Parameterisation of engineering models • Engineering + Machine learning
  • 6. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 6 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
  • 7. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 7 CFD and Algebraic-Multigrid • Solve set of partial-differential equations over several time steps – Discretization: • Unstructured vs Structured – Equations: • Velocity • Pressure • Turbulence • Iterative solvers – Jacobi – Gauss-Seidel – Conjugate Gradient • Multigrid approaches – Solve the problem at different resolutions • Coarse and fine grids/meshes – Less Iterations for fine grids • Algebraic multigrid (AMG) – encode mesh information in algebraic format – Sparse matrices. source: http://web.utk.edu/~wfeng1/research.html
  • 8. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 8 CFD and Algebraic-Multigrid source: http://web.utk.edu/~wfeng1/research.html NVLINKTM NVLINKTM NVLINKTM NVLINKTM InfiniBandTM MPI rank • Grid partitioned by MPI ranks • Ranks distributed by nodes • More than one rank executing in one node • Challenges: – Different grids have different compute needs – Access strides vary, unstructured data accesses. – CPU-GPU data movements – Regular communication between ranks • Halo elements • Residuals • Synchronizations
  • 9. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 9 CFD and Algebraic-Multigrid – Code Saturne • Open-source – developed and maintained by EDF • 350K lines of code: – 50% C – 37% Fortran – 13% Python • Rich ecosystem to configure/parameterise simulations, generate meshes • History of good scalability Cores Time in Solver Efficiency 262,144 789.79 s - 524,288 403.18 s 97% MPI Tasks Time in Solver Efficiency 524,288 70.114 s - 1,048,576 52.574 s 66% 1,572,864 45.731 s 76% 105B Cell Mesh (MIRA, BGQ) 13B Cell Mesh (MIRA, BGQ)
  • 10. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 10 CFD and Algebraic-Multigrid – Execution time distribution • Many components (kernels) contribute to total execution time • There are data dependencies between consecutive kernels • There are opportunities to keep data in the device between kernels • Some kernels may have lower compute intensity, it could still be worthwhile computing them in the GPU if the data is already there Gauss-Seidel solver (Velocity) Other Matrix-vector mult. MSR Matrix-vector mult. CSR Dot products Multigrid setup Compute coarse cells from fine cells Other AMG-related Pressure (AMG) Single thread profiling - Code Saturne 5.0+
  • 11. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 11 Directive-based programming models • Porting existing code to accelerators is time consuming… • The sooner we have code running in the GPU the sooner you can start … – … learning where overheads are – … identifying what data patterns are being used – … spotting kernels performing poorly – … making decisions on what strategies can be used to improve performance • Directive-based programming models can get you started much quicker – Don’t need to bother about device memory allocation and data pointers – Implementation defaults already exploiting device features – Easily create data environments where data resides in the GPU – Improve your code portability • Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support • PGI C/C++/Fortran compiler provide OpenACC support • Can be complemented with existing GPU accelerated libraries – cuSparse – AMGx XL clang
  • 12. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 12 Directive-based programming models • OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient IBM Confidential GPU acceleration in Code Saturne 1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */) 2 { 3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 4 /* Move result vector to device and copied it back at the ned of the scope */ 5 map(tofrom:vx[: vx_size ]) 6 /* Move right -hand side vector to the device */ 7 map(to:rhs [: n_rows ]) 8 /* Allocate all auxiliary vectors in the device */ 9 map(alloc: _aux_vectors [: tmp_size ]) 10 { 11 12 /* Solver code */ 13 14 } 15 } Listing 2: OpenMP 4.5 data environment for a level of the AMG solver. during the computation of a level so it can be copied to the device at the beginning of the level. The result vector can also be kept in the device for a significant part of the execution, and only has to be copied to the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos All arrays reside in the device in this scope! The programming model manages host/device pointers mapping for you!
  • 13. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 13 Directive-based programming models • OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products 6 vx[ii] += (alpha * dk[ii]); 7 rk[ii] += (alpha * zk[ii]); 8 } 9 10 /* ... */ 11 } 12 13 /* ... */ 14 15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n, 16 const cs_real_t *restrict x, 17 const cs_real_t *restrict y, 18 double *xx , 19 double *xy) 20 { 21 double dot_xx = 0.0, dot_xy = 0.0; 22 23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 24 if ( n > GPU_THRESHOLD ) 25 map(to:x[:n],y[:n]) 26 map(tofrom:dot_xx , dot_xy) 27 for (cs_lnum_t i = 0; i < n; ++i) { 28 const double tx = x[i]; 29 const double ty = y[i]; 30 dot_xx += tx*tx; 31 dot_xy += tx*ty; 32 } 33 34 /* ... */ 35 36 *xx = dot_xx; 37 *xy = dot_xy; 38 } Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product . … Host … … CUDA blocks OpenMP team Allocate data in the the device. Host Release data in the the device. OpenMP runtime library OpenMP runtime library
  • 14. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 14 Directive-based programming models • OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline AMG cycle AMG coarse grid detail Allocations of small variables High kernel launch latency Back-to-back kernels
  • 15. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 15 CUDA-based tuning • Avoid expensive GPU memory allocation/deallocation: – Allocate a memory chunk once and reuse it • Use pinned memory for data copied frequently to the GPU – Avoid pageable-pinned memory copies by the CUDA implementation • Explore asynchronous execution of CUDA API calls – Start copying data to/from the device while the host is preparing the next set of data or the next kernel • Use CUDA constant memory to copy arguments for multiple kernels at once. – The latency of copying tens of KB to the GPU is similar to copy 1B – Dual-buffering enable copies to happen asynchronously • Produce specialized kernels instead of relying on runtime checks. – CUDA is a C++ extension and therefore kernels and device functions can be templated. – Leverage compile-time optimizations for the relevant sequences of kernels. – NVCC toolchain does very aggressive inlining. – Lower register pressure = more occupancy. IBM Confidential GPU acceleration in Cod 1 template < KernelKinds Kind > 2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) { 3 switch(Kind) { 4 /* ... */ 5 // Dot product: 6 // 7 case DP_xx: 8 dot_product <Kind >( 9 /* version */ Arg.getArg <cs_lnum_t >(0), 10 /* n_rows */ Arg.getArg <cs_lnum_t >(1), 11 /* x */ Arg.getArg <cs_real_t * >(2), 12 /* y */ nullptr , 13 /* z */ nullptr , 14 /* res */ Arg.getArg <cs_real_t * >(3), 15 /* n_rows_per_block */ n_rows_per_block ); 16 break; 17 /* ... */ 18 } 19 __syncthreads (); 20 return 0; 21 } 22 23 template < KernelKinds ... Kinds > 24 __global__ void any_kernels (void) { 25 26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]); 27 const unsigned n_rows_per_block = KA -> RowsPerBlock ; 28 unsigned idx = 0; 29 30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... }; 31 (void) dummy; 32 } Listing 10: Device entry-point function for kernel execution.
  • 16. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 16 CUDA-based tuning • CUDA – Code Saturne 5.0+ – Results for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid 57.21 43.83 34.77 30.28 49.86 37.37 29.67 25.63 11.87 10.83 9.55 9.32 4.41 4.34 4.40 4.63 0 10 20 30 40 50 60 70 1 2 4 8 Time (seconds) OpenMP threads Wall time CPU Solver time CPU Wall time CPU+GPU Solver time CPU+GPU 4.82 4.05 3.64 3.25 11.29 8.60 6.74 5.53 0 2 4 6 8 10 12 1 2 4 8 GPU speedup (1x) OpenMP threads Wall time Solvers time Execution time Speed up
  • 17. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 17 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid Gauss-Seidel AMG fine grid
  • 18. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 18 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.) – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid AMG coarse grid
  • 19. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 19 MPI and GPU acceleration • Different processes (MPI ranks) will use different CUDA contexts. • CUDA implementation serializes CUDA contexts by default. • NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the same GPU. MPS server instance GPU driver Define Visible GPU Start MPS server Execute application Terminate MPS server Define Visible GPU Execute application Define Visible GPU Execute application Rank 0 Rank 1 Rank 2
  • 20. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 20 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid Gauss-Seidel Hiding data movement latencies
  • 21. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 21 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid AMG coarse grid
  • 22. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 22 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes Execution time Speed up 2.39 2.42 2.32 2.22 2.08 2.00 2.53 2.57 2.45 2.37 2.20 2.10 1.5 1.7 1.9 2.1 2.3 2.5 2.7 1 2 4 8 16 32 Speedup over CPU-only (1x) Nodes Wall time Solvers time 717.6 369.9 187.4 100.2 54.9 28.9 693.9 358.0 181.5 97.2 53.4 28.1 300.4 153.1 80.8 45.2 26.4 14.4 274.7 139.6 74.0 41.1 24.3 13.410.0 100.0 1000.0 1 2 4 8 16 32 Time (seconds) Nodes CPU wall time CPU solvers time CPU+GPU wall time CPU+GPU solvers time
  • 23. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 23 POWER8 to POWER9 • A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs – No major code refactoring needed. – More powerful CPUs, GPUs and interconnect. • Some differences to consider: – Core vs Pairs of Cores • POWER9 L3 cache and store queue is shared for each pair of cores • SMT4 per core or SMT8 per pair-of-cores – V100 (Volta) drops lock-step execution per warp-threads • One program-counter per thread • If code assumes lock-step execution explicit barriers have to be inserted • No guarantee threads will converge after divergence within a warp • One has to leverage cooperative groups and thread activity masks NVLINKTM NVLINKTM NVLINKTM ORNL Summit Socket (2 sockets per node) for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) { // Depending on the number of rows - warps may diverge here unsigned AM = __activemask(); … for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2) sii += __shfl_down_sync(AM, sii,kk, bdimx); … }
  • 24. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 24 POWER8 to POWER9 • CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node – IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes 2.31 2.90 2.34 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 64 256 512 Speedup over CPU-only (1x) Nodes Wall time 74.73 21.04 11.16 32.3 7.25 4.76 1.0 10.0 100.0 64 256 512 Time (seconds) Nodes CPU wall time CPU+GPU wall time Execution time Speed up POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
  • 25. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 25 CFD and AI • Cognitive Enhanced Design – Design / prototype new pieces of equipment is expensive (time and finance) • Parameter sweeps need several expensive simulations • Want to make decisions faster • Make decisions on more complex problems – Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design • Converge to optimal parameters more quickly – Example: airfoil optimization: Lift/Drag maximization • Adaptive Expected Improvement (EI) converges faster and with less variance. Work package ML1 Cognitive Enhanced Design § Problem: Design / prototyping of new pieces of equipment can be expensive (time and finance). Want to do more work in silico, and also use an ‘intelligent’ design process § Solution: Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design. a
  • 26. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 26 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection § Problem: One typical bottleneck in the simulation-led workflow, is the analysis of the output produced by the simulation itself – especially the identification of features (e.g. for flow; separation, swirl, layering). § Solution: Extend deep-feature detection to 3-dimensional problems to remove this bottleneck from the design workflow § Additional extensions planned for the semantic querying of simulation data, contextual event classification, and computational steering for rare-event simulation Workpackage ML3 Enhanced 3D-Feature Detection
  • 27. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 27 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection
  • 28. © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 28 Questions? samuel.antao@ibm.com