SlideShare a Scribd company logo
1 of 25
Download to read offline
Fine grained asynchronism for pseudo-spectral codes
- with application to turbulence
Kiran Ravikumar1
, David Appelhans2
, P. K. Yeung1
kiran.r@gatech.edu
1Georgia Institute of Technology, 2IBM research
3rd
OpenPOWER Academia Discussion Group Workshop
Dallas, TX; November 10, 2018
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 1 / 23
INTRODUCTION AND MOTIVATION
Petascale computing mainly via massive distributed parallelism
Extreme core counts, w/ or w/o some shared memory (MPI+OpenMP)
Scalability often ultimately limited by communication costs
Now in pre-Exascale era, heterogeneous architectures are dominant
Accelerators provide most of computing power, at cost of non-trivial data movements
But communication may still be an issue (perhaps even more so)
On the largest machines, adaptation to hardware (e.g. GPUs) is a must
How do we do it, on Summit? [for turbulence, our domain science]
Multidimensional FFT provides a case study of wide relevance
Fine-grained asynchronism: while managing memory and data movements
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 2 / 23
OUTLINE OF THIS TALK
Turbulence and the need for extreme scale computing
– pseudo-spectral codes and Fourier transforms
How to make use of GPUs effectively (specifically, on Summit)
– while taking full advantage of large CPU memory (fat nodes)
– optimizing data movements between CPU and GPU
– asynchronism at a fine-grained level, using CUDA Fortran
Performance recorded on Summit Phase 1 (to date, up to 1024 nodes)
Other thoughts: such as optimizing communication
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 3 / 23
WHY DO WE NEED HUGE TURBULENCE SIMULATIONS?
Turbulence is found everywhere in nature and engineering
Disorderly fluctuations over a wide range of scales in 3D space and time
To advance understanding, obtain numerical solutions of governing equations
Capture wide range of scales, resolve small scales better than in the past, cover
parameter regimes previously relatively unexplored, etc.
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 4 / 23
DIRECT NUMERICAL SIMULATIONS (DNS)
Compute all scales, according to (here incompressible) Navier-Stokes equations
· u = 0
∂u/∂t + u · u = − (p/ρ) + ν 2
u + f
Study fundamental behavior of turbulent flows using simplified geometries
— 3D Fourier decomposition on periodic domain appropriate for the small scales
Pseudo-spectral in space for FT of nonlinear terms: instead of expensive convolution
sums, perform multiplication in physical space; take FFTs back and forth
Explicit second order Runge-Kutta time stepping scheme
On Blue Waters at NCSA: production simulation at 81923
grid resolution
(Yeung et al. PNAS 2015), with some short tests at 122883
and 163843
INCITE 2019 on Summit: our target is 184323
: i.e. O(6) trillion grid points
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 5 / 23
HOW TO COMPUTE 3D FFTS?
Transforms between physical (x, y, z) and wavenumber (kx, ky, kz) spaces, any variable φ
φ(x) =
k
ˆφ(k) exp(ik · x) ; ˆφ(k) =
x
φ(x) exp(−ik · x)
1D FFTs, 1 direction at a time, but complete lines of data must be available
— use FFTW (on CPUs) or cuFFT (on GPUs)
Domain decomposition (1D or 2D) among MPI processes
— up to N vs N2
MPI tasks allowed
Collective communication (of “alltoall” type) needed to complete full 3D FFT cycle
— protocols for faster communication are of broad interest
Use of strided (non-contiguous) arrays may require packing and unpacking, which
are local (serial) in memory but still time consuming
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 6 / 23
DOMAIN DECOMPOSITION: 1D OR 2D
1D: Each MPI rank holds a slab of size N × N × N/P,
i.e. a collection of N/P planes of data
— one global transpose (e.g. x−y to x−z)
2D: Pencils, of size N × N/P1 × N/P2 where P1 × P2 = P
define a 2D Cartesian process grid (whose shape matters)
— two transposes, within row and column communicators
Massive distributed memory parallelism favors Pencils
Fatter node architectures : return to 1D decomposition?
– Fewer MPI and associated pack and unpack operations
– Fewer nodes (and MPI ranks) involved in communication
– Large host memory allows for many slabs per node
Slabs decomposition
y
x
z
P0
P1
P2
P3
Pencils decomposition
y
x
z
P0 P1
P2 P3
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 7 / 23
MULTI-THREADED SYNCHRONOUS CPU ALGORITHM
FFT : (kx, ky, kz) → (kx, y, kz)
Pack
MPI Alltoall
Unpack
FFT : (kx, y, kz) → (x, y, z)
Similar for (x, y, z) → (kx, ky, kz)
FFTW library to carry out 1-D transforms
OpenMP threads to speedup FFT, pack & unpack
y
x
x-y slab
z
x x-z slab
y
x
x-y slab
A2A
A2A
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 8 / 23
SYNCHRONOUS GPU ALGORITHM
H2D copy of complete slab
FFT : (kx, ky, kz) → (kx, y, kz)
Pack
Device to Host
MPI Alltoall
Host to Device
Unpack
FFT : (kx, y, kz) → (x, y, z)
Similar for (x, y, z) → (kx, ky, kz)
Steps involved in transform from wavenumber to
physical space shown
Entire slab of data is copied from host to device
Pack and unpack data on GPU
– Additional buffers occupy limited GPU memory
– Faster than on CPU
Similar operations performed to transform to
wavenumber from physical space with the last
step being a device to host copy of the entire slab
How to handle problem sizes that do not fit on
the GPU?
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 9 / 23
HOW CAN SUMMIT HELP?
200 PF : Fastest supercomputer in the world
4608 nodes each with 2 IBM POWER9 CPU’s &
6 NVIDIA Volta GPU’s
512 GB host memory per node & 16 GB GPU
memory : large problem size
50 GB/s peak CPU-GPU NVLink B/W : faster copies
23 GB/s Node injection B/W with Non-blocking fat
tree topology : faster communication
Enables 184323
DNS using 3072 nodes
Run massive problem sizes on fewer nodes than other machines currently available!
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 10 / 23
ASYNCHRONOUS GPU ALGORITHM
Operations in same row : asynchronous
nplanes : 1 → mz
H2D Compute D2H (Pack) A2A
MPI_WAITALL
nplanes : 1 → my
H2D (Unpack) Compute D2H
y
x
x-y Slab mz
Colors match operations in chart
Only 3 planes on GPU
Problem size not limited by GPU
memory (can also split planes)
Overlap operation across 3 planes
CUDA streams for async progress
Compute (compute stream) & NVLink
(transfer stream) overlapped by MPI
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 11 / 23
ASYNCHRONOUS EXECUTION USING CUDAFORTRAN
First plane is copied in before entering the following loop.
1 do zp=1,mz
2 next = mod(zp+1,3); current = mod(zp,3); previous = mod(zp−1,3); comm = mod(zp−2,3);
3 ! if any of above are zero, set to 3, negative cases are not considered
4
5 ! Host−>Device copy of plane zp+1 (NEXT) using cudaMemCpy or similar in transfer stream
6 ierr =cudaEventRecord(HtoD(next),trans_stream)
7
8 ierr =cudaStreamWaitEvent(comp_stream,HtoD(current),0)
9 ! Compute on zp plane (CURRENT) in compute stream
10 ierr =cudaEventRecord(compute(current),comp_stream)
11
12 ierr =cudaStreamWaitEvent(trans_stream,compute(previous),0)
13 ! Device−>Host copy of plane zp−1 (PREVIOUS) using cudaMemCpy or similar in transfer stream
14 ierr =cudaEventRecord(DtoH(previous),trans_stream)
15
16 ierr =cudaEventSynchronize(DtoH(comm))
17 ! MPI Alltoall on plane zp−2 copied out in the previous iteration
18 end do
Last plane is copied out after the loop. MPI on last 2 planes are performed.
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 12 / 23
NON-CONTIGUOUS DATA COPIES
Non-contiguous memory copy b/w host & device
Pack and transfer
Many instances of cudaMemCpy, each with small contiguous msg - high overhead
Host - u
Device
d_u
CoalescedMemory
nx
nxf
nz
u(1 : nxf, :) ←→ d_u(:, :)
nxf = nx
4
Is there a faster way?
Zero copy
cudaMemCpy2DAsync
Zero copy (D. Appelhans GTC 2018) -
accessing host resident pinned memory
directly from GPU without having to copy
the data to the device beforehand (i.e. there
are zero device copies)
cudaMemCpy2DAsync -
1 ! call cudaMemCpy2DAsync (dst,dpitch,src,spitch,&
2 ! width, height ,kind,stream)
3 call cudaMemCpy2DAsync (d_u,nxf,u,nx,&
4 nxf,nz,kind,stream)
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 13 / 23
ZERO COPY AND CUDAMEMCPY2DASYNC
multiple cudaMemCpyAsync
Zero copy
0
50
100
150
200
250
6KB 1.5MB
Time(ms)
H2D D2H
Symbols - zero copy
Dashed - cudaMemCpy2DAsync
Zero copy preferred when coalesced message size is small (< 10KB for 184323
)
16 threadblocks sufficient to get maximum NVLink throughput.
GPU resources can be used for computation instead.
Ideally use cudaMemCpy2DAsync to avoid using up GPU resources
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 14 / 23
IS THERE ASYNCHRONOUS EXECUTION?
NVPROF timeline for async. GPU code using zero copy for one Runge-Kutta substep
MPI_WAITALL
Compute and NVLINK (other colors). Zero copy kernels (dark green) also incur significant cost.
Further operations have to wait on previous alltoall to complete.
GPU computation (bottom) and NVLINK transfer (top, almost continuous) streams
Computations begin once data transfer on previous buffer is done.
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 15 / 23
PERF. OF MULTI-THREADED CPU & ASYNC. GPU CODES
Problem size and run configuration
2 nodes - 15363
16 nodes - 30723
128 nodes - 60443
1024 nodes - 122883
6 MPI tasks per node & 7 threads per task
0
0.2
0.4
0.6
0.8
1
2 16 128 1024
#Nodes
CPU
tmpi
ttotal
tcomp
ttotal
0
0.2
0.4
0.6
0.8
1
2 16 128 1024
#Nodes
GPU
tmpi
ttotal
tcomp+copy
ttotal
Computations accelerated using GPU’s, while % communication increases
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 16 / 23
SCALING & SPEEDUP OVER MULTI-THREADED CPU CODE
Perf. of GPU async and CPU DNS codes Summit Phase-1 (Early Science Proposal (ESP))
0.001
0.01
0.1
1
10
100
1 10 100 1000 10000
# nodes
Time per step (s)/# nodes
3.65X
2.90X
3.26X
3.50X
CPU
GPU
Dashed line denotes ideal weak scaling.
GPU to CPU speedups labeled in the plot.
# Prob. Time(s)
nodes Size (a) (b)
2 15363
31.70 8.67
16 30723
35.69 12.32
128 61443
46.68 14.34
1024 122883
77.60 22.15
(a) - CPU
(b) - Async GPU using zero copy
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 17 / 23
DOES CUDAMEMCPY2DASYNC IMPROVE PERFORMANCE?
NVLINK
Compute
Zerocopy-D2H Zerocopy-H2D
Unpack cudaMemCpy2DAsync
NVPROF timeline for a period of 60 ms showing the asynchronous behavior of GPU
code. Top - (a) Using zerocopy kernels. Bottom - (b) Using cudaMemCpy2DAsync
instead of D2H zerocopy kernel
(a) : GPU resources shared b/w zerocopy & compute kernels, slowing down both
(b) : GPU resources available for compute, speeding up data transfers and compute
(b) : Some zero copy kernels needed for transfers with a more complicated striding
pattern, minimal performance effect
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 18 / 23
PERFORMANCE IMPROVEMENTS & PROJECTIONS
Perf. of async. GPU using cudaMemCpy2DAsync instead of some zerocopy kernels.
0.001
0.01
0.1
1
10
100
1 10 100 1000 10000
# nodes
Time per step (s)/# nodes
6.25X
4.68X
5.56X
5.99X
CPU
GPU-ESP
GPU-Modified
GPU-Projection
# Prob. Time(s)
nodes Size (a) (b)
2 15363
8.67 5.07
16 30723
12.32 7.63
128 61443
14.34 8.39
1024 122883
22.15 12.96
(a) - using zero copy
(b) - using cudaMemCpy2DAsync
The projected performance of the code for 128 and 1024 nodes is estimated assuming the
same weak scaling performance as seen in the earlier runs (GPU-ESP).
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 19 / 23
MPI ALLTOALL COLLECTIVE PERFORMANCE
Environment variables :
PAMI_ENABLE_STRIPING=0
– Disable splitting data to network devices
PAMI_IBV_ADAPTER_AFFINITY=1
– Use nearest network device
PAMI_IBV_DEVICE_NAME=”mlx5_0:1”
– Use this device if on socket 0
PAMI_IBV_DEVICE_NAME_1=”mlx5_3:1”
– Use this device if on socket 1
> 10% perf. improvement compared to
default environment
Eff. BW =
2 × size of sndbuf in GB × tpn
time
=
2 × N × N
P
× P × N
P
× 4 × tpn
time × 10243
2 - read+write ; 4
10243 - GB ; tpn - tasks per node
1 do istep =1,nsteps
2
3 call MPI_BARRIER(MPI_COMM_WORLD,ierr)
4 rtime1=MPI_WTIME()
5 call MPI_ALLTOALL(sndbuf,nx*my*mz,&
6 MPI_REAL,rcvbuf,nx*my*mz,&
7 MPI_REAL,MPI_COMM_WORLD,ierr)
8 rtime2=MPI_WTIME()
9 time_step( istep )=rtime2−rtime1
10
11 end do
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 20 / 23
HOW TO IMPROVE ALLTOALL PERFORMANCE?
Tested different methods
Blocking MPI Alltoall
Non-blocking MPI Alltoall
Non-blocking sends and receives
One-sided MPI
Hierarchical method
Hierarchical Alltoall (A2A)
A2A on Node (Row communicator)
Pack to form contiguous messages
A2A across Node (Column comm.)
Alltoall using One-sided MPI
1 call MPI_WIN_CREATE(sndbuf,size,sizeofreal,&
2 MPI_INFO_NULL,MPI_COMM_WORLD,sndwin,ierr)
3
4 call MPI_WIN_FENCE(0,sndwin,ierr)
5
6 call MPI_GET_ADDRESS(sndbuf(1,1,taskid+1),disp,ierr)
7 disp=MPI_AINT_DIFF(disp,base)
8 disp=disp/ sizeofreal
9
10 do zg=1,numtasks
11 call MPI_GET(rcvbuf(1,1,zg),nx*my,MPI_REAL,&
12 zg-1,disp,nx*my,MPI_REAL,sndwin,ierr)
13 end do
14 call MPI_WIN_FENCE(0,sndwin,ierr)
None of these methods seem to perform better than simple MPI alltoall
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 21 / 23
CONCLUSIONS
Successful development of a highly scalable GPU-accelerated algorithm for
turbulence and 3DFFT exploiting unique features of Summit
— fine-grained asynchronism allows problem sizes beyond GPU memory
CUDA Fortran implementation gives GPU speedup > 4 at smaller problem sizes,
likely to hold up to largest problem size feasible on Summit (184323
)
— Host-to-device and device-to-host data copies optimized using “zero-copy”
approach and cudaMemCpy2DAsync, depending on message size
Communication costs still a major factor
— bandwidth measured rigorously and various approaches tested
Working towards OpenMP 4.5 or higher implementation
Towards extreme-scale simulations w/ INCITE 2019 + Early Science Program.
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 22 / 23
REFERENCES
[1] P. K. Yeung, X. M. Zhai, and K. R. Sreenivasan. Extreme events in computational turbulence. Proc.
Nat. Acad. Sci., 112:12633–12638, 2015.
[2] P. K. Yeung, Sreenivasan. K. R., and S. B. Pope. Effects of finite spatial and temporal resolution on
extreme events in direct numerical simulations of incompressible isotropic turbulence. Phys. Rev.
Fluids, 3:064603, 2018.
[3] M. P. Clay, D. Buaria, P. K. Yeung, and T. Gotoh. GPU acceleration of a petascale application for
turbulent mixing at high Schmidt number using OpenMP 4.5. Comput. Phys. Commun., 228:100–114,
2018.
[4] G. Ruetsch and M. Fatica. CUDA Fortran for Scientists and Engineers. Morgan Kaufmann Publishers,
2014.
[5] D. Appelhans. Tricks, Tips, and Timings: The Data Movement Strategies You Need to Know. In GPU
Technology Conference, 2018.
[6] K. Ravikumar, D. Appelhans, P. K. Yeung, and M. P. Clay. An asynchronous algorithm for massive
pseudo-spectral simulations of turbulence on Summit. In Poster presented at the OLCF Users Meeting,
2018.
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 23 / 23
STRIDED COPIES
1 real , allocatable , device :: devicebuf (:,:) ,&
2 d_hostbuf (:,:)
3 type(C_DEVPTR) :: d_temp
4
5 ! get pointer on device that points to host array
6 ierr =cudaHostGetDevicePointer(d_temp,&
7 C_LOC(hostbuf(1,1)),0)
8 ! convert to Fortran pointer
9 call C_F_POINTER(d_temp,d_hostbuf,[nx,nz])
0
1 call zc_h2d<<<grid,tblock,0,trans_stream>>>&
2 (devicebuf ,d_hostbuf,nx,nxf,nz,ix)
1 attributes (global) subroutine zc_h2d(devicebuf,&
2 hostbuf ,nx,nxf,nz,ix)
3
4 real , device, intent (INOUT) :: devicebuf(nxf,nz),&
5 hostbuf(nx,nz)
6 integer , value , intent (IN) :: nx,nxf,nz,ix
7 integer :: z,x
8
9 do z=blockIdx%x,nz,gridDim%x
10 do x=threadIdx%x,nxf,blockDim%x
11 devicebuf(x,z)=hostbuf(ix+x−1,z)
12 end do
13 end do
14
15 end subroutine zc_h2d
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 24 / 23
SECTIONAL DATA COPIES USING OPENMP
How to use OpenMP to manage
memory?
N/P planes memory on the CPU
Use only 3 planes of memory on GPU
Simple to process one plane at a time
1 do zp=1,mz
2 !$OMP TARGET ENTER DATA MAP(to:u(1:N,1:N,zp))
3
4 !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO
5 ....
6 !$OMP END TARGET TEAMS DISTRIBUTE
PARALLEL DO
7
8 !$OMP TARGET EXIT DATA MAP(from:u(1:N,1:N,zp))
9 end do
3 planes on GPU using structure of pointers
1 type ptr_struc
2 real (kind=4), pointer :: ptr (:,:)
3 end type ptr_struc
4 type( ptr_struc ) :: p_a(3)
5
6 p_a(1)%ptr => a(1:nx,1:ny,1) ! similar for p_a(2:3)
7
8 !$OMP TARGET DATA MAP(to:p_a(:))
9 do zp=1,mz
10 ! copy in next plane
11 p_a(next)%ptr => a (:,:, zp+1)
12 !$OMP TARGET ENTER DATA MAP(to:p_a(next)%ptr)
13
14 ! Compute on current plane
15
16 ! copy out previous plane
17 !$OMP TARGET EXIT DATA MAP(from:p_a(prev)%ptr)
18 end do
19 !$OMP END TARGET DATA
K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 25 / 23

More Related Content

What's hot

Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsThilina Gunarathne
 
Realtime 3D Visualization without GPU
Realtime 3D Visualization without GPURealtime 3D Visualization without GPU
Realtime 3D Visualization without GPUTobias G
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAHiroki Nakahara
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksSang Jun Lee
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...jaliyae
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...Hiroki Nakahara
 
PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)Gernot Ziegler
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5dm_work
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
 

What's hot (20)

Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Realtime 3D Visualization without GPU
Realtime 3D Visualization without GPURealtime 3D Visualization without GPU
Realtime 3D Visualization without GPU
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
 
PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)
 
GPU - how can we use it?
GPU - how can we use it?GPU - how can we use it?
GPU - how can we use it?
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 

Similar to Fine grained asynchronism for pseudo-spectral codes - with application to turbulence

Slide 1
Slide 1Slide 1
Slide 1butest
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Larry Smarr
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Nucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDANucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDAChristos Kallidonis
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityLinaro
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...LEGATO project
 
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Roberto Casadei
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 

Similar to Fine grained asynchronism for pseudo-spectral codes - with application to turbulence (20)

Slide 1
Slide 1Slide 1
Slide 1
 
Slide 1
Slide 1Slide 1
Slide 1
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
20090720 smith
20090720 smith20090720 smith
20090720 smith
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Nucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDANucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDA
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
 
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Fine grained asynchronism for pseudo-spectral codes - with application to turbulence

  • 1. Fine grained asynchronism for pseudo-spectral codes - with application to turbulence Kiran Ravikumar1 , David Appelhans2 , P. K. Yeung1 kiran.r@gatech.edu 1Georgia Institute of Technology, 2IBM research 3rd OpenPOWER Academia Discussion Group Workshop Dallas, TX; November 10, 2018 K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 1 / 23
  • 2. INTRODUCTION AND MOTIVATION Petascale computing mainly via massive distributed parallelism Extreme core counts, w/ or w/o some shared memory (MPI+OpenMP) Scalability often ultimately limited by communication costs Now in pre-Exascale era, heterogeneous architectures are dominant Accelerators provide most of computing power, at cost of non-trivial data movements But communication may still be an issue (perhaps even more so) On the largest machines, adaptation to hardware (e.g. GPUs) is a must How do we do it, on Summit? [for turbulence, our domain science] Multidimensional FFT provides a case study of wide relevance Fine-grained asynchronism: while managing memory and data movements K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 2 / 23
  • 3. OUTLINE OF THIS TALK Turbulence and the need for extreme scale computing – pseudo-spectral codes and Fourier transforms How to make use of GPUs effectively (specifically, on Summit) – while taking full advantage of large CPU memory (fat nodes) – optimizing data movements between CPU and GPU – asynchronism at a fine-grained level, using CUDA Fortran Performance recorded on Summit Phase 1 (to date, up to 1024 nodes) Other thoughts: such as optimizing communication K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 3 / 23
  • 4. WHY DO WE NEED HUGE TURBULENCE SIMULATIONS? Turbulence is found everywhere in nature and engineering Disorderly fluctuations over a wide range of scales in 3D space and time To advance understanding, obtain numerical solutions of governing equations Capture wide range of scales, resolve small scales better than in the past, cover parameter regimes previously relatively unexplored, etc. K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 4 / 23
  • 5. DIRECT NUMERICAL SIMULATIONS (DNS) Compute all scales, according to (here incompressible) Navier-Stokes equations · u = 0 ∂u/∂t + u · u = − (p/ρ) + ν 2 u + f Study fundamental behavior of turbulent flows using simplified geometries — 3D Fourier decomposition on periodic domain appropriate for the small scales Pseudo-spectral in space for FT of nonlinear terms: instead of expensive convolution sums, perform multiplication in physical space; take FFTs back and forth Explicit second order Runge-Kutta time stepping scheme On Blue Waters at NCSA: production simulation at 81923 grid resolution (Yeung et al. PNAS 2015), with some short tests at 122883 and 163843 INCITE 2019 on Summit: our target is 184323 : i.e. O(6) trillion grid points K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 5 / 23
  • 6. HOW TO COMPUTE 3D FFTS? Transforms between physical (x, y, z) and wavenumber (kx, ky, kz) spaces, any variable φ φ(x) = k ˆφ(k) exp(ik · x) ; ˆφ(k) = x φ(x) exp(−ik · x) 1D FFTs, 1 direction at a time, but complete lines of data must be available — use FFTW (on CPUs) or cuFFT (on GPUs) Domain decomposition (1D or 2D) among MPI processes — up to N vs N2 MPI tasks allowed Collective communication (of “alltoall” type) needed to complete full 3D FFT cycle — protocols for faster communication are of broad interest Use of strided (non-contiguous) arrays may require packing and unpacking, which are local (serial) in memory but still time consuming K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 6 / 23
  • 7. DOMAIN DECOMPOSITION: 1D OR 2D 1D: Each MPI rank holds a slab of size N × N × N/P, i.e. a collection of N/P planes of data — one global transpose (e.g. x−y to x−z) 2D: Pencils, of size N × N/P1 × N/P2 where P1 × P2 = P define a 2D Cartesian process grid (whose shape matters) — two transposes, within row and column communicators Massive distributed memory parallelism favors Pencils Fatter node architectures : return to 1D decomposition? – Fewer MPI and associated pack and unpack operations – Fewer nodes (and MPI ranks) involved in communication – Large host memory allows for many slabs per node Slabs decomposition y x z P0 P1 P2 P3 Pencils decomposition y x z P0 P1 P2 P3 K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 7 / 23
  • 8. MULTI-THREADED SYNCHRONOUS CPU ALGORITHM FFT : (kx, ky, kz) → (kx, y, kz) Pack MPI Alltoall Unpack FFT : (kx, y, kz) → (x, y, z) Similar for (x, y, z) → (kx, ky, kz) FFTW library to carry out 1-D transforms OpenMP threads to speedup FFT, pack & unpack y x x-y slab z x x-z slab y x x-y slab A2A A2A K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 8 / 23
  • 9. SYNCHRONOUS GPU ALGORITHM H2D copy of complete slab FFT : (kx, ky, kz) → (kx, y, kz) Pack Device to Host MPI Alltoall Host to Device Unpack FFT : (kx, y, kz) → (x, y, z) Similar for (x, y, z) → (kx, ky, kz) Steps involved in transform from wavenumber to physical space shown Entire slab of data is copied from host to device Pack and unpack data on GPU – Additional buffers occupy limited GPU memory – Faster than on CPU Similar operations performed to transform to wavenumber from physical space with the last step being a device to host copy of the entire slab How to handle problem sizes that do not fit on the GPU? K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 9 / 23
  • 10. HOW CAN SUMMIT HELP? 200 PF : Fastest supercomputer in the world 4608 nodes each with 2 IBM POWER9 CPU’s & 6 NVIDIA Volta GPU’s 512 GB host memory per node & 16 GB GPU memory : large problem size 50 GB/s peak CPU-GPU NVLink B/W : faster copies 23 GB/s Node injection B/W with Non-blocking fat tree topology : faster communication Enables 184323 DNS using 3072 nodes Run massive problem sizes on fewer nodes than other machines currently available! K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 10 / 23
  • 11. ASYNCHRONOUS GPU ALGORITHM Operations in same row : asynchronous nplanes : 1 → mz H2D Compute D2H (Pack) A2A MPI_WAITALL nplanes : 1 → my H2D (Unpack) Compute D2H y x x-y Slab mz Colors match operations in chart Only 3 planes on GPU Problem size not limited by GPU memory (can also split planes) Overlap operation across 3 planes CUDA streams for async progress Compute (compute stream) & NVLink (transfer stream) overlapped by MPI K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 11 / 23
  • 12. ASYNCHRONOUS EXECUTION USING CUDAFORTRAN First plane is copied in before entering the following loop. 1 do zp=1,mz 2 next = mod(zp+1,3); current = mod(zp,3); previous = mod(zp−1,3); comm = mod(zp−2,3); 3 ! if any of above are zero, set to 3, negative cases are not considered 4 5 ! Host−>Device copy of plane zp+1 (NEXT) using cudaMemCpy or similar in transfer stream 6 ierr =cudaEventRecord(HtoD(next),trans_stream) 7 8 ierr =cudaStreamWaitEvent(comp_stream,HtoD(current),0) 9 ! Compute on zp plane (CURRENT) in compute stream 10 ierr =cudaEventRecord(compute(current),comp_stream) 11 12 ierr =cudaStreamWaitEvent(trans_stream,compute(previous),0) 13 ! Device−>Host copy of plane zp−1 (PREVIOUS) using cudaMemCpy or similar in transfer stream 14 ierr =cudaEventRecord(DtoH(previous),trans_stream) 15 16 ierr =cudaEventSynchronize(DtoH(comm)) 17 ! MPI Alltoall on plane zp−2 copied out in the previous iteration 18 end do Last plane is copied out after the loop. MPI on last 2 planes are performed. K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 12 / 23
  • 13. NON-CONTIGUOUS DATA COPIES Non-contiguous memory copy b/w host & device Pack and transfer Many instances of cudaMemCpy, each with small contiguous msg - high overhead Host - u Device d_u CoalescedMemory nx nxf nz u(1 : nxf, :) ←→ d_u(:, :) nxf = nx 4 Is there a faster way? Zero copy cudaMemCpy2DAsync Zero copy (D. Appelhans GTC 2018) - accessing host resident pinned memory directly from GPU without having to copy the data to the device beforehand (i.e. there are zero device copies) cudaMemCpy2DAsync - 1 ! call cudaMemCpy2DAsync (dst,dpitch,src,spitch,& 2 ! width, height ,kind,stream) 3 call cudaMemCpy2DAsync (d_u,nxf,u,nx,& 4 nxf,nz,kind,stream) K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 13 / 23
  • 14. ZERO COPY AND CUDAMEMCPY2DASYNC multiple cudaMemCpyAsync Zero copy 0 50 100 150 200 250 6KB 1.5MB Time(ms) H2D D2H Symbols - zero copy Dashed - cudaMemCpy2DAsync Zero copy preferred when coalesced message size is small (< 10KB for 184323 ) 16 threadblocks sufficient to get maximum NVLink throughput. GPU resources can be used for computation instead. Ideally use cudaMemCpy2DAsync to avoid using up GPU resources K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 14 / 23
  • 15. IS THERE ASYNCHRONOUS EXECUTION? NVPROF timeline for async. GPU code using zero copy for one Runge-Kutta substep MPI_WAITALL Compute and NVLINK (other colors). Zero copy kernels (dark green) also incur significant cost. Further operations have to wait on previous alltoall to complete. GPU computation (bottom) and NVLINK transfer (top, almost continuous) streams Computations begin once data transfer on previous buffer is done. K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 15 / 23
  • 16. PERF. OF MULTI-THREADED CPU & ASYNC. GPU CODES Problem size and run configuration 2 nodes - 15363 16 nodes - 30723 128 nodes - 60443 1024 nodes - 122883 6 MPI tasks per node & 7 threads per task 0 0.2 0.4 0.6 0.8 1 2 16 128 1024 #Nodes CPU tmpi ttotal tcomp ttotal 0 0.2 0.4 0.6 0.8 1 2 16 128 1024 #Nodes GPU tmpi ttotal tcomp+copy ttotal Computations accelerated using GPU’s, while % communication increases K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 16 / 23
  • 17. SCALING & SPEEDUP OVER MULTI-THREADED CPU CODE Perf. of GPU async and CPU DNS codes Summit Phase-1 (Early Science Proposal (ESP)) 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 # nodes Time per step (s)/# nodes 3.65X 2.90X 3.26X 3.50X CPU GPU Dashed line denotes ideal weak scaling. GPU to CPU speedups labeled in the plot. # Prob. Time(s) nodes Size (a) (b) 2 15363 31.70 8.67 16 30723 35.69 12.32 128 61443 46.68 14.34 1024 122883 77.60 22.15 (a) - CPU (b) - Async GPU using zero copy K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 17 / 23
  • 18. DOES CUDAMEMCPY2DASYNC IMPROVE PERFORMANCE? NVLINK Compute Zerocopy-D2H Zerocopy-H2D Unpack cudaMemCpy2DAsync NVPROF timeline for a period of 60 ms showing the asynchronous behavior of GPU code. Top - (a) Using zerocopy kernels. Bottom - (b) Using cudaMemCpy2DAsync instead of D2H zerocopy kernel (a) : GPU resources shared b/w zerocopy & compute kernels, slowing down both (b) : GPU resources available for compute, speeding up data transfers and compute (b) : Some zero copy kernels needed for transfers with a more complicated striding pattern, minimal performance effect K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 18 / 23
  • 19. PERFORMANCE IMPROVEMENTS & PROJECTIONS Perf. of async. GPU using cudaMemCpy2DAsync instead of some zerocopy kernels. 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 # nodes Time per step (s)/# nodes 6.25X 4.68X 5.56X 5.99X CPU GPU-ESP GPU-Modified GPU-Projection # Prob. Time(s) nodes Size (a) (b) 2 15363 8.67 5.07 16 30723 12.32 7.63 128 61443 14.34 8.39 1024 122883 22.15 12.96 (a) - using zero copy (b) - using cudaMemCpy2DAsync The projected performance of the code for 128 and 1024 nodes is estimated assuming the same weak scaling performance as seen in the earlier runs (GPU-ESP). K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 19 / 23
  • 20. MPI ALLTOALL COLLECTIVE PERFORMANCE Environment variables : PAMI_ENABLE_STRIPING=0 – Disable splitting data to network devices PAMI_IBV_ADAPTER_AFFINITY=1 – Use nearest network device PAMI_IBV_DEVICE_NAME=”mlx5_0:1” – Use this device if on socket 0 PAMI_IBV_DEVICE_NAME_1=”mlx5_3:1” – Use this device if on socket 1 > 10% perf. improvement compared to default environment Eff. BW = 2 × size of sndbuf in GB × tpn time = 2 × N × N P × P × N P × 4 × tpn time × 10243 2 - read+write ; 4 10243 - GB ; tpn - tasks per node 1 do istep =1,nsteps 2 3 call MPI_BARRIER(MPI_COMM_WORLD,ierr) 4 rtime1=MPI_WTIME() 5 call MPI_ALLTOALL(sndbuf,nx*my*mz,& 6 MPI_REAL,rcvbuf,nx*my*mz,& 7 MPI_REAL,MPI_COMM_WORLD,ierr) 8 rtime2=MPI_WTIME() 9 time_step( istep )=rtime2−rtime1 10 11 end do K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 20 / 23
  • 21. HOW TO IMPROVE ALLTOALL PERFORMANCE? Tested different methods Blocking MPI Alltoall Non-blocking MPI Alltoall Non-blocking sends and receives One-sided MPI Hierarchical method Hierarchical Alltoall (A2A) A2A on Node (Row communicator) Pack to form contiguous messages A2A across Node (Column comm.) Alltoall using One-sided MPI 1 call MPI_WIN_CREATE(sndbuf,size,sizeofreal,& 2 MPI_INFO_NULL,MPI_COMM_WORLD,sndwin,ierr) 3 4 call MPI_WIN_FENCE(0,sndwin,ierr) 5 6 call MPI_GET_ADDRESS(sndbuf(1,1,taskid+1),disp,ierr) 7 disp=MPI_AINT_DIFF(disp,base) 8 disp=disp/ sizeofreal 9 10 do zg=1,numtasks 11 call MPI_GET(rcvbuf(1,1,zg),nx*my,MPI_REAL,& 12 zg-1,disp,nx*my,MPI_REAL,sndwin,ierr) 13 end do 14 call MPI_WIN_FENCE(0,sndwin,ierr) None of these methods seem to perform better than simple MPI alltoall K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 21 / 23
  • 22. CONCLUSIONS Successful development of a highly scalable GPU-accelerated algorithm for turbulence and 3DFFT exploiting unique features of Summit — fine-grained asynchronism allows problem sizes beyond GPU memory CUDA Fortran implementation gives GPU speedup > 4 at smaller problem sizes, likely to hold up to largest problem size feasible on Summit (184323 ) — Host-to-device and device-to-host data copies optimized using “zero-copy” approach and cudaMemCpy2DAsync, depending on message size Communication costs still a major factor — bandwidth measured rigorously and various approaches tested Working towards OpenMP 4.5 or higher implementation Towards extreme-scale simulations w/ INCITE 2019 + Early Science Program. K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 22 / 23
  • 23. REFERENCES [1] P. K. Yeung, X. M. Zhai, and K. R. Sreenivasan. Extreme events in computational turbulence. Proc. Nat. Acad. Sci., 112:12633–12638, 2015. [2] P. K. Yeung, Sreenivasan. K. R., and S. B. Pope. Effects of finite spatial and temporal resolution on extreme events in direct numerical simulations of incompressible isotropic turbulence. Phys. Rev. Fluids, 3:064603, 2018. [3] M. P. Clay, D. Buaria, P. K. Yeung, and T. Gotoh. GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5. Comput. Phys. Commun., 228:100–114, 2018. [4] G. Ruetsch and M. Fatica. CUDA Fortran for Scientists and Engineers. Morgan Kaufmann Publishers, 2014. [5] D. Appelhans. Tricks, Tips, and Timings: The Data Movement Strategies You Need to Know. In GPU Technology Conference, 2018. [6] K. Ravikumar, D. Appelhans, P. K. Yeung, and M. P. Clay. An asynchronous algorithm for massive pseudo-spectral simulations of turbulence on Summit. In Poster presented at the OLCF Users Meeting, 2018. K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 23 / 23
  • 24. STRIDED COPIES 1 real , allocatable , device :: devicebuf (:,:) ,& 2 d_hostbuf (:,:) 3 type(C_DEVPTR) :: d_temp 4 5 ! get pointer on device that points to host array 6 ierr =cudaHostGetDevicePointer(d_temp,& 7 C_LOC(hostbuf(1,1)),0) 8 ! convert to Fortran pointer 9 call C_F_POINTER(d_temp,d_hostbuf,[nx,nz]) 0 1 call zc_h2d<<<grid,tblock,0,trans_stream>>>& 2 (devicebuf ,d_hostbuf,nx,nxf,nz,ix) 1 attributes (global) subroutine zc_h2d(devicebuf,& 2 hostbuf ,nx,nxf,nz,ix) 3 4 real , device, intent (INOUT) :: devicebuf(nxf,nz),& 5 hostbuf(nx,nz) 6 integer , value , intent (IN) :: nx,nxf,nz,ix 7 integer :: z,x 8 9 do z=blockIdx%x,nz,gridDim%x 10 do x=threadIdx%x,nxf,blockDim%x 11 devicebuf(x,z)=hostbuf(ix+x−1,z) 12 end do 13 end do 14 15 end subroutine zc_h2d K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 24 / 23
  • 25. SECTIONAL DATA COPIES USING OPENMP How to use OpenMP to manage memory? N/P planes memory on the CPU Use only 3 planes of memory on GPU Simple to process one plane at a time 1 do zp=1,mz 2 !$OMP TARGET ENTER DATA MAP(to:u(1:N,1:N,zp)) 3 4 !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO 5 .... 6 !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO 7 8 !$OMP TARGET EXIT DATA MAP(from:u(1:N,1:N,zp)) 9 end do 3 planes on GPU using structure of pointers 1 type ptr_struc 2 real (kind=4), pointer :: ptr (:,:) 3 end type ptr_struc 4 type( ptr_struc ) :: p_a(3) 5 6 p_a(1)%ptr => a(1:nx,1:ny,1) ! similar for p_a(2:3) 7 8 !$OMP TARGET DATA MAP(to:p_a(:)) 9 do zp=1,mz 10 ! copy in next plane 11 p_a(next)%ptr => a (:,:, zp+1) 12 !$OMP TARGET ENTER DATA MAP(to:p_a(next)%ptr) 13 14 ! Compute on current plane 15 16 ! copy out previous plane 17 !$OMP TARGET EXIT DATA MAP(from:p_a(prev)%ptr) 18 end do 19 !$OMP END TARGET DATA K. Ravikumar, D. Appelhans, P. K. Yeung (GT, IBM) Fine grained asynchronism for pseudo-spectral codes November 10, 2018 25 / 23