vasp-gpu on Balena: Usage and Some Benchmarks

Balena User Group Meeting
3rd February 2017
vasp-gpu on Balena:
Usage and Some Benchmarks

Ø The VASP SCF cycle in a nutshell
Ø Parallelisation in VASP
o Workload and data distribution
o Parallelisation control parameters
o Some rules of thumb for optimising parallel scaling
Ø The GPU (CUDA) port of VASP
o Compiling and running
o Features
o Some initial benchmarks
Ø Thoughts and discussion points
Balena User Group Meeting, February 2017 | Slide 2
Overview

http://www.iue.tuwien.ac.at/phd/goes/dissse14.html
S. Mainz et al., Comput. Phys. Comm. 182, 1421 (2011)
The VASP SCF cycle in a nutshell

Ø The newest versions of VASP implement four levels of parallelism:
o k-point parallelism: KPAR
o Band parallelism and data distribution: NCORE and NPAR
o Parallelisation and data distribution over plane-wave coefficients (= FFTs; done over
planes along NGZ): LPLANE
o Parallelisation of some linear-algebra operations using ScaLAPACK (notionally set at
compile time, but can be controlled at runtime using LSCALAPACK)
Ø Effective parallelisation will…:
o … minimise (relatively slow) communication between MPI processes, …
o … distribute data to reduce memory requirements, …
o … and make sure the MPI processes have enough work to keep them busy
Parallelisation in VASP

MPI processes
KPAR k-point
groups
NPAR band
groups
NGZ FFT
groups (?)
Ø Workload distribution over KPAR k-point groups, NBANDS band groups and NGZ plane-
wave coefficient (FFT) groups [not 100 % sure how this works…]
Parallelisation: Workload distribution

Data
KPAR k-point
groups
NPAR band
groups
NGZ FFT
groups (?)
Ø Data distribution over NBANDS band groups and NGZ plane-wave coefficient (FFT) groups
[also not 100 % sure how this works…]
Parallelisation: Data distribution

Ø During a standard DFT calculation, k-points are independent -> k-point parallelism should
be linearly scaling, although perhaps not in practice:
https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores
Ø WARNING: <#procs> must be divisible by KPAR, but the parallelisation is via a round-
robin algorithm, so <#k-points> does not need to be divisible by KPAR -> check how
many irreducible k-points you have (IBZKPT file) and set KPAR accordingly
k1
k2
k3
k1 k2
k3
k1 k2 k3
KPAR = 1
t = 3 [OK]
KPAR = 2; t = 2 [Bad]
KPAR = 3
t = 1 [Good]
R1
R2
R3
R1
R2
R1
Parallelisation: KPAR

NCORE : number of cores in band groups
NPAR : number of bands treated simultaneously NCORE =
< # procs >
NPAR
Ø For NCORE = 1/NPAR = <#procs> (the default), more band groups appears to
increase memory pressure and incur a substantial communication overhead
7.08x
6.41x
6.32x
Parallelisation: NCORE and NPAR

Ø WARNING: VASP will increase the default NBANDS to the nearest multiple of the number
of groups
Ø Since the electronic minimisation scales as a power of NBANDS,this can backfire in
calculations with a large NPAR (e.g. those requiring NPAR = <#procs>)
Cores
NBANDS
Default Adjusted
96 455 480
128 455 512
192 455 576
256 455 512
384 455 768
512 455 512
NBANDS =
NELECT
2
+
NIONS
2
Example system:
• 238 atoms w/ 272 electrons
• Default NBANDS = 455
NBANDS =
3
5
NELECT + NMAG
Parallelisation: NCORE and NPAR

Ø The RMM-DIIS (ALGO = VeryFast | Fast) algorithm involves three steps:
EDDIAG : subspace diagonalisation
RMM-DIIS : electronic minimisation
ORTHCH : wavefunction orthogonalisation
Routine 312 atoms 624 atoms 1,248 atoms 1,872 atoms
EDDIAG 2.90 (18.64 %) 12.97 (22.24 %) 75.26 (26.38 %) 208.29 (31.31 %)
RMM-DIIS 12.39 (79.63 %) 42.73 (73.27 %) 187.62 (65.78 %) 379.80 (57.10 %)
ORTHCH 0.27 (1.74 %) 2.62 (4.49 %) 22.36 (7.84 %) 77.11 (11.59 %)
Ø EDDIAG and ORTHCH formally scale as N3, and rapidly begin to dominate the SCF cycle
time for large calculations
Ø A good ScaLAPACK library can improve the performance of these routines in massively-
parallel calculations
See also: https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k
Parallelisation: ScaLAPACK

Ø KPAR: current implementation does not distribute data over k-point groups -> KPAR =
N will use N× more memory than KPAR = 1
Ø NPAR/NCORE: data is distributed over band groups -> decreasing NPAR/increasing
NCORE will considerably reduce memory requirements
Ø NPAR takes precedence over NCORE - if you use “master” INCAR files, make sure you
don’t define both
Ø The defaults for NPAR/NCORE (NPAR = <#procs>, NCORE = 1) are usually a poor
choice for both memory requirements and performance
Ø Band parallelism for hybrid functionals has been supported since VASP 5.3.5; for memory-
intensive calculations, it is a good alternative to underpopulating nodes
Ø LPLANE: distributes data over plane-wave coefficients, and speeds things up by reducing
communication during FFTs - the default is LPLANE = .TRUE., and should only need
to be changed for massively-parallel architectures (e.g. BlueGene/Q)
Parallelisation: Memory

Ø For x86_64 IB systems (e.g. Balena, Archer…):
o Try KPAR for heavy calculations (e.g. hybrids)
o Set NPAR = (<#procs>/KPAR) or NCORE = <#procs/node>
o 1 node/band group per 50 atoms; may want to use 2 nodes/50 atoms for hybrids, or
decrease to ½ node per band group for < 10 atoms
o Leave LPLANE at the default (.TRUE.)
o WARNING: In my experience of Cray systems (Archer/XC30, SiSu/XC40), using KPAR
sometimes causes VASP to hang during multistep calculations (e.g. optimisations)
Ø For the IBM BlueGene/Q (STFC Hartree Centre):
o Last time I used it, the Hartree machine only had VASP 5.2.x -> no KPAR
o Try to choose a square number of cores, and set NPAR = sqrt(<#procs>)
o Consider setting LPLANE = .FALSE. if <#procs> ≥ NGZ
Parallelisation: Some rules of thumb

Ø GPU computing works in an offload model
Ø Programming models such as CUDA and OpenCL provide APIs for:
o Copying memory to and from the GPU
o Compiling kernel programs to run on the GPU
o Setting up and running kernels on input data
Ø Porting codes for GPUs involves identifying routines that can be efficiently mapped to the
GPU architecture, writing kernels, and interfacing them to the CPU code
Data
Data Program
Program
Run
Data
Data
CPU
GPU
GPU computing

vasp-gpu
Ø Starting from the February 2016 release of VASP 5.4.1, the distribution includes a CUDA
port that offloads some of the core DFT routines onto NVIDIA GPUs
Ø A culmination of research at the University of Chicago, Carnegie Mellon and ENS-Lyon, and
a healthy dose of optimisation by NVIDIA
Ø Three papers covering the implementation and testing:
o M. Hacene et al., J. Comput. Chem. 33, 2581 (2012), 10.1002/jcc.23096
o M. Hutchinson and W. Widom, Comput. Phys. Comm. 183, 1422 (2012),
10.1002/jcc.23096
o S. Mainz et al., Comput. Phys. Comm. 182, 1421 (2011), 10.1016/j.cpc.2011.03.010

Because sharing is caring...
https://github.com/JMSkelton/VASP-GPU-Benchmarking

Ø Easy(ish) with the VASP 5.4.1 build system:
o Load cuda/toolkit (along with intel/compiler, intel/mkl, etc.)
o Modify the arch/makefile.include.linux_intel_cuda example
o Make the gpu and/or gpu_ncl targets
intel/compiler/64/15.0.0.090
intel/mkl/64/11.2
openmpi/intel/1.8.4
cuda/toolkit/7.5.18
FC = mpif90
FCL = mpif90 -mkl -lstdc++
...
CUDA_ROOT :=
/cm/shared/apps/cuda75/toolkit/7.5.18
...
MPI_INC =
/apps/openmpi/intel-2015/1.8.4/include/
https://github.com/JMSkelton/VASP-GPU-Benchmarking/Compilation
vasp-gpu: Compilation
Ø Available as a module on Balena: module load untested vasp/intel/5.4.1

Ø To use vasp-gpu on Balena, you need to request a GPU-equipped node and perform
some basic setup tasks in your SLURM scripts
#SBATCH --partition=batch-acc
# Node w/ 1 k20x card.
#SBATCH --gres=gpu:1
#SBATCH --constraint=k20x
# Node w/ 4 k20x cards.
##SBATCH --gres=gpu:4
##SBATCH --constraint=k20x
if [ ! -d "/tmp/nvidia-mps" ] ; then
mkdir "/tmp/nvidia-mps"
fi
export CUDA_MPS_PIPE_DIRECTORY=
"/tmp/nvidia-mps"
if [ ! -d "/tmp/nvidia-log" ] ; then
mkdir "/tmp/nvidia-log"
fi
export CUDA_MPS_LOG_DIRECTORY=
"/tmp/nvidia-log"
nvidia-cuda-mps-control -d
https://github.com/JMSkelton/VASP-GPU-Benchmarking/Scripts
vasp-gpu: Running jobs

Ø Uses cuFFT and CUDA ports of compute-heavy parts of the SCF cycle
Ø ALGO = Normal | VeryFast (+ Fast) w/ LREAL = Auto fully supported, along
with KPAR, exact exchange and non-collinear spin
Ø ALGO = All | Damped and the GW routines work, but are not optimised (“passively
supported”)
Ø LREAL = .FALSE., NCORE > 1 (NPAR != N) and electric fields are not supported
(will crash with an error)
Ø Currently no Gamma-only version
Ø Future roadmap: Γ-point optimisations and support for LREAL = .FALSE., vdW
functionals, RPA/GW calculations and band parallelism
vasp-gpu: Features

Ø Each MPI process allocates its own set of cuFFT plans and CUDA kernels, distributing
round-robin among the available GPUs
Ø The size of the CUDA kernels is controlled by NSIM: broadly, NSIM ↑ = better GPU
utilisation but higher memory requirements
Ø <#procs> should be a multiple of <#GPUs>, and for most systems you will probably
end up underpopulating the CPUs
Proc 1
Proc 2
Proc 3
Proc 4
GPU 1
GPU 2
Proc 1
Proc 2
Proc 3
Proc 4
GPU 1
GPU 2
GPU 3
GPU 4
vasp-gpu: Load balancing

Ø 64 to 1,024 atoms in a random cubic arrangement; ALGO = VeryFast w/ LREAL =
Auto, k = Γ; 1 GPU node w/ 1 or 4 Tesla K20x cards vs. 1 compute node
vasp-gpu: Benchmarking

NSIM
1 2 4 8 12 16 24 32 48 64
#MPI Processes
1 13.52 8.88 8.15 7.82 7.77 7.76 7.72 7.74 7.81 7.89
2 9.11 6.75 6.34 6.21 6.23 6.21 6.23 6.25 6.32 OOM
4 6.72 5.57 5.33 5.24 5.29 5.30 OOM OOM OOM OOM
8 6.01 5.26 5.14 OOM OOM OOM OOM OOM OOM OOM
12 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
16 OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM

0.0
1.0
2.0
3.0
4.0
5.0
64 128 192 256 320 384 448 512
Speedup(vasp_gam)
# Atoms
1 GPU 4 GPUs
0.0
1.0
2.0
3.0
4.0
5.0
64 128 192 256 320 384 448 512
Speedup(vasp_std)
# Atoms
1 GPU 4 GPUs

NSIM
1 2 4 8 16
#MPI Processes
1 -14131.52 -158.39 -158.39 -158.39 -158.39
2 -14131.52 -158.39 -158.39 -158.39 -158.39
4 -14131.52 -158.39 -158.39 -158.39 -158.39
8 -14131.52 -158.39 -158.39 - -
12 - - - - -
16 - - - - -

Ø Three papers covering the implementation and testing…:
o M. Hacene et al., J. Comput. Chem. 33, 2581 (2012), 10.1002/jcc.23096
o M. Hutchinson and W. Widom, Comput. Phys. Comm. 183, 1422 (2012),
10.1002/jcc.23096
o S. Mainz et al., Comput. Phys. Comm. 182, 1421 (2011), 10.1016/j.cpc.2011.03.010
Ø … and a couple of other links:
o https://www.vasp.at/index.php/news/44-administrative/115-new-release-vasp-5-4-
1-with-gpu-support
o https://www.nsc.liu.se/~pla/blog/2015/11/16/vaspgpu/
o http://images.nvidia.com/events/sc15/SC5120-vasp-gpus.html
Further reading

Ø Understanding the parallelisation in VASP and applying a few simple rules of thumb can
make your jobs scale better and use less resources (the default settings aren’t great...)
Ø At the moment, running VASP on GPUs is mostly for interest:
o Does not benefit all types of job
o Requires some fiddly testing to get the best performance
o If you will be running a lot of a suitable workload on Balena (e.g. large MD jobs), it
could be worth the effort
Ø Aims for further benchmark tests:
o What types of job benefit from GPU acceleration?
o What is the most “balanced” configuration (1/2/4 GPUs/node)?
o Is it possible to run over multiple GPU nodes?
o Can GPUs be a cost/power efficient way to run certain VASP jobs?
Thoughts and discussion points

Acknowledgements

vasp-gpu on Balena: Usage and Some Benchmarks

More Related Content

What's hot

Recently uploaded

vasp-gpu on Balena: Usage and Some Benchmarks