Continuous integration, delivery, and deployment (CICD) is widely
used in DevOps communities, as it allows for teams of all sizes to
deploy rapidly-changing hardware and software resources quickly
and confidently.
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
HPC_MPI_CICD.pptx
1. CI/CD
Contents
Overview
Bluefield Run
OMB Tests
IMB Tests
MPICH Tests
NAS Tests
References
Overview
CI/CD falls under DevOps (the joining of
development and operations) and combines the
practices of continuous integration and continuous
delivery. CI/CD automates much or all of the manual
2. human intervention traditionally needed to get new
code from a commit into production such as build,
test, and deploy, as well as infrastructure
provisioning. With a CI/CD pipeline, developers can
make changes to code that are then automatically
tested and pushed out for delivery and
deployment. With CI/CD right code releases happen
faster.
Continuous integration is the practice of integrating
all code changes into the main branch of a shared
source code repository early and often, automatically
testing each change when commit or merge happens,
and automatically kicking off a build. With
continuous integration, errors and security issues can
be identified and fixed more easily, and much earlier
in the software development lifecycle.
Continuous delivery is a software development
practice that works in conjunction with continuous
integration to automate the infrastructure
provisioning and application release process.
Once code has been tested and built as part of the CI
process, continuous delivery takes over during the
final stages to ensure it can be deployed packaged
with everything it needs to deploy to any
environment at any time. Continuous delivery can
cover everything from provisioning the infrastructure
3. to deploying the application to the testing or
production environment.
With continuous delivery, the software is built so that
it can be deployed to production at any time. Then
one can trigger the deployments manually or move to
continuous deployment where deployments are
automated as well.
Directory Tree
The pipeline is living in
/global/home/users/rgopal/CITest/usr/bin. This is the
base directory and where the gitlab runner is
installed. From here, the structure of the directory
looks like:
/global/home/users/rgopal/CITest/usr/bin
|src/
|...
|<git pull location>
|builds/
|<commit_hash>
|gcc/
|install/
|logs/
|<commit_hash>
|nas
|mpich
4. |imb
|omb
|tests/
|tmp/
|<commit_hash>
|...
|mv2
src/
The location that the gitlab-runner clones the most
recent commit. This directory is checked for changes
at each new job. Do not change any files here in any
of the jobs.
builds/
The install location of the built files from the most
recent commit
logs/
Where is testing logs live
tests/
The binaries of all of the external tests (not built
alongside mv2)
5. tmp/
Location of temporary files.
Pipeline Structure
Currently, the pipeline is divided into 4 phases:
build
test
verify
clean
Within each phase, there are separate jobs being
executed at the same time. For each new job, the
gitlab runner makes a clone from master into a
directory and tries to start from clean slate.
build
For the build step, the entire repo is copied into
$GITLAB_BASE_DIR/tmp/mv2. This is done
because if we run a build in the directory, the
following phases will notice the files have changed.
They'll try to make a reset and will eventually
complain.
It then runs autogen, make, make install and installs
to /builds/<commit_hash>/gcc.
6. test
This phase just submits batch jobs using sbatch.
Currently, we're using the thor partition. The scripts
will save the output to
$GITLAB_BASE_DIR/logs/<commit_hash> and
when it's done, it will generate a file in
$GITLAB_BASE_DIR/tmp/<commit_hash>.
This phase should be completed in a couple of
seconds. We're just submitting jobs to sbatch and
then checking for them in the verify phase
verify
Checks for the status of the tests that we submitted to
sbatch. The scripts in the test phase should generate a
.done file in
$GITLAB_BASE_DIR/tmp/<commit_hash>. The
verify scripts loop through and check for that. Once
it's found, it will grep the output from the batch job to
look for errors and log them in
$GITLAB_BASE_DIR/logs/<commit_hash>/<test>/
clean
Cleans up any extra files, removes the builds and any
generated hostfiles
7. Bluefield Run
Building
After cloning, run ./build.sh. This script will run
autogen, configure, make, and make install. If you
open the script and set ARMSRC_DIR to another
clone of this repo in a separate folder (make sure it's
on the same commit), it will launch a parallel build.
Note: Open the script and set LICENSE=0 before
running in order to build without the need for a
license file.
Note 2: Building on ARM takes a long time. Wait
around 20 minutes for it to complete. The host build
will finish much faster. Don't forget that both host
and ARM need to finish before you can run!
The script uses a separate set of configure flags in
order to allow both SRC_DIR and ARMSRC_DIR to
have the same --prefix. Basically, the ARMSRC_DIR
flags will build mpicc, mpispawn, proxy_program,
etc. all appended with the -arm suffix to distinguish
the host binaries from arm binaries.
If you see an error "cannot execute binary file",
please make sure that file ./install/bin/mpispawn
reports as x86 executable, and file
./install/bin/proxy_program reports as aarch64
8. executable. If either is wrong, then run either cp
proxy_program-arm proxy_program or cp mpispawn-
x86 mpispawn. If these files don't exist, rerun make
&& make install to regenerate them.
Environment Setup
On HPCAC, the Thor hosts have two physical HCAs
plugged in. One is the BF2, and the other is a
ConnectX-6:
[rgopal@thor011 xsc]$ ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.30.1004
Hardware version: 0
Node GUID: 0x98039b03008553e6
System image GUID: 0x98039b03008553e6
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 60
LMC: 0
SM lid: 9
Capability mask: 0x2651e848
Port GUID: 0x98039b03008553e6
Link layer: InfiniBand
9. CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.30.1004
Hardware version: 0
Node GUID: 0x98039b03008553e7
System image GUID: 0x98039b03008553e6
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 41
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x98039b03008553e7
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT41686
Number of ports: 1
Firmware version: 24.30.1004
Hardware version: 0
Node GUID: 0x043f720300ec7f0e
System image GUID: 0x043f720300ec7f0e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
10. Base lid: 301
LMC: 0
SM lid: 9
Capability mask: 0x2651e848
Port GUID: 0x043f720300ec7f0e
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT41686
Number of ports: 1
Firmware version: 24.30.1004
Hardware version: 0
Node GUID: 0x043f720300ec7f0f
System image GUID: 0x043f720300ec7f0e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 4
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x043f720300ec7f0f
Link layer: InfiniBand
On the ARM cores, only the BF-2 is visible:
[rgopal@thor-bf11 ~]$ ibstat
CA 'mlx5_0'
11. CA type: MT41686
Number of ports: 1
Firmware version: 24.30.1004
Hardware version: 0
Node GUID: 0x043f720300ec7f12
System image GUID: 0x043f720300ec7f0e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 321
LMC: 0
SM lid: 9
Capability mask: 0x2641e848
Port GUID: 0x043f720300ec7f12
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT41686
Number of ports: 1
Firmware version: 24.30.1004
Hardware version: 0
Node GUID: 0x043f720300ec7f13
System image GUID: 0x043f720300ec7f0e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 51
12. LMC: 0
SM lid: 1
Capability mask: 0x2641e848
Port GUID: 0x043f720300ec7f13
Link layer: InfiniBand
In order to run the offload, set the following snippet
in ~/.bashrc to select the BF-2 on the host while also
setting the BF-2 on the ARM cores.
STR=`hostname`
SUB="bf"
if [[ "$STR" == *"$SUB"* ]]; then
export MV2_IBA_HCA=mlx5_0
else
export MV2_IBA_HCA=mlx5_2
fi
Running
Create a hostfile as usual. (a hostfile contains the
list of hostnames of nodes to launch in the MPI
job
Create a file called dpufile. Fill it with the
individual hostnames of each BlueField. Don't
write any WPN information (like thor-bf01:2 or
thor-bf01nthor-bf01) since the launcher will
launch 8 WPN automatically. For example, you
13. can generate a dpufile like this if you have a job
allocation with SLURM: scontrol show
hostnames | grep bf | tee ./dpufile Set
MV2_USE_DPU=1 as an environment variable
in mpirun_rsh.
Full run command example:
./bin/mpirun_rsh -np 128 -hostfile ./hostfile -dpufile
./dpufile MV2_USE_DPU=1 ./libexec/osu-micro-
benchmarks/mpi/collective/osu_ialltoall
OMB tests
Osu MicroBenchmarks (OMB) is a benchmark suite
developed by NOWLAB that is included in every
installation of MVAPICH2 (including MVAPICH2-
DPU) in the install-prefix/libexec/osu-micro-
benchmarks folder as a binary after building, and in
the osu_benchmarks folder as source code.
Additionally, it can be downloaded as a standalone
package here: http://mvapich.cse.ohio-
state.edu/benchmarks/.
The benchmark suite has tests for MPI point-to-point
(sending and recving between exactly two processes),
collectives (communication among groups of
processes), and RMA (one-sided put and get (sending
14. without a corresponding recv, or recv'ing without a
corresponding send), and others)
As of this writing, the MVAPICH2-DPU package
supports MPI_Ialltoall, MPI_Ibcast, and
MPI_Iallgather collective offloads.
This is a good picture describing what these
collectives do. Each row is a buffer (sendbuf or
recvbuf) on a single process:
15. Also, it is important to note that there is an I in front
of the collective name. This means the collective is
actually nonblocking. To demonstrate this, let me
show the usage of a blocking alltoall (i.e.,
MPI_Alltoall) and compare it with a nonblocking
alltoall (i.e. MPI_Ialltoall).
16. MPI_Alltoall:
// Recvbuf empty
MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, comm); // May take some time
to complete
// Recvbuf guaranteed to be full
MPI_Ialltoall
MPI_Request request;
// Recvbuf empty
MPI_Ialltoall(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, comm, &request); // Returns
instantly
// Do another job while the non-blocking all to all
progresses heavy_computation which_does_not
depend on recvbuf()
// Recvbuf may or may not be filled
MPI_Wait(&request, MPI_STATUS_IGNORE); //
May or may not take much time to complete,
depending on how long heavy_computation
which_does_not_depend_on_recvbuf took
// Recvbuf guaranteed to be filled
17. Since MVAPICH2-DPU supports MPI_Ialltoall,
MPI_Ibcast, and MPI_Iallgather, we are mainly
interested in the output of osu_ialltoall,
osu_iallgather, and osu_ibcast. Binaries for these can
be found in the install-prefix/libexec/osu-micro-
benchmarks/mpi/collective folder.
Following MPI tests are included in the OMB
package:
Point-to-Point MPI Benchmarks: Latency, multi-
threaded latency, multi-pair latency, multiple
bandwidth / message rate test, bandwidth,
bidirectional bandwidth
Collective MPI Benchmarks: Collective latency
tests for various MPI collective operations such as
MPI_Allgather,
MPI_Barrier,
MPI_Alltoall,
MPI_Bcast,
MPI_Allreduce,
MPI_Gather,
MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter
and vector collectives.
Non-Blocking Collective (NBC) MPI
Benchmarks: Collective latency and Overlaptests
for various MPI collective operations such as
MPI_Iallgather, MPI_Iallreduce, MPI_Ialltoall,
MPI_Ibarrier, MPI_Ibcast, MPI_Igather,
18. MPI_Ireduce, MPI_Iscatter and vector
collectives.
One-sided MPI Benchmarks: one-sided put
latency, one-sided put bandwidth, one-sided put
bidirectional bandwidth, one-sided get latency,
one-sided get bandwidth, one-sided accumulate
latency, compare and swap latency, fetch and
operate and get_accumulate latency for
MVAPICH2 (MPI-2 and MPI-3).
omb-refactor.sh runs OMB tests.
Then the tests are run on 1, 2, 4, 8 and 16 Nodes with
full subscription
It is run with MV2_USE_DPU=0 and
MV2_USE_DPU=1.
All tests are run with the below options
COMMON="MV2_DEBUG_SHOW_BACKTRAC
E=2 MV2_ENABLE_AFFINITY=0"
MV2_DEBUG_SHOW_BACKTRACE
Show a backtrace when a process fails on errors like
”Segmentation faults”, ”Bus error”, ”Illegal
Instruction”, ”Abort” or ”Floating point exception”.
MV2_ENABLE_AFFINITY
19. Enable CPU affinity by setting
MV2_ENABLE_AFFINITY to 1 or disable it by
setting MV2_ENABLE_AFFINITY to 0.
Set this to limit, in seconds, of the execution time of
the mpi application. This overwritesthe
MV2_MPIRUN_TIMEOUT parameter.
Point-to-Point MPI Benchmarks
osu_latency - Latency Test
osu_latency_mt - Multi-threaded Latency Test
osu_latency_mp - Multi-process Latency Test
osu_bw - Bandwidth Test
osu_bibw - Bidirectional Bandwidth Test
osu_mbw_mr - Multiple Bandwidth / Message
Rate Test
osu_multi_lat - Multi-pair Latency Test
Pt2Pt tests are run on 1 node[2ppn] and 2
nodes[1ppn]
Total Tests: (7 tests * 2 scenarios[host/dpu]) = 14
Collective MPI Benchmarks
osu_allgather - MPI_Allgather Latency Test
osu_allgatherv - MPI_Allgatherv Latency Test
osu_allreduce - MPI_Allreduce Latency Test
osu_alltoall - MPI_Alltoall Latency Test
20. osu_alltoallv - MPI_Alltoallv Latency Test
osu_barrier - MPI_Barrier Latency Test
osu_bcast - MPI_Bcast Latency Test
osu_gather - MPI_Gather Latency Test
osu_gatherv - MPI_Gatherv Latency Test
osu_reduce - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter
Latency Test
osu_scatter - MPI_Scatter Latency Test
osu_scatterv - MPI_Scatterv Latency Test
Non-Blocking Collective (NBC) MPI Benchmarks
osu_iallgather - MPI_Iallgather Latency Test
osu_iallgatherv - MPI_Iallgatherv Latency Test
osu_iallreduce - MPI_Iallreduce Latency Test
osu_ialltoall - MPI_Ialltoall Latency Test
osu_ialltoallv - MPI_Ialltoallv Latency Test
osu_ialltoallw - MPI_Ialltoallw Latency Test
osu_ibarrier - MPI_Ibarrier Latency Test
osu_ibcast - MPI_Ibcast Latency Test
osu_igather - MPI_Igather Latency Test
osu_igatherv - MPI_Igatherv Latency Test
osu_ireduce - MPI_Ireduce Latency Test
osu_iscatter - MPI_Iscatter Latency Test
osu_iscatterv - MPI_Iscatterv Latency Test
Collective tests are run on 1,2,4,8 and 16 nodes with
full subscription [ 16ppn]
21. Total Tests: (26 tests * 2 scenarios[host/dpu]) = 54 – 1
= 53
Note: ialltoall is a time consuming test hence it is
run with the max message size of 32KB, rest all are
run with the default max message size
One-sided MPI Benchmarks
osu_put_latency - Latency Test for Put with
Active/Passive Synchronization
osu_get_latency - Latency Test for Get with
Active/Passive Synchronization
osu_put_bw - Bandwidth Test for Put with
Active/Passive Synchronization
osu_get_bw - Bandwidth Test for Get with
Active/Passive Synchronization
osu_put_bibw - Bi-directional Bandwidth Test
for Put with Active Synchronization
osu_acc_latency - Latency Test for Accumulate
with Active/Passive Synchronization
osu_cas_latency - Latency Test for Compare and
Swap with Active/Passive Synchronization
osu_fop_latency - Latency Test for Fetch andOp
with Active/Passive Synchronization
22. Latency Test for
with Active/Passive
osu_get_acc_latency -
Get_accumulate
Synchronization
RMA tests are run one or two nodes with ppn = 2 or
ppn = 1 respectively
Total Tests: (9 tests * 2 scenarios[host/dpu]) = 18
IMB Tests
The objectives of the Intel® MPI Benchmarks are:
•Provide a concise set of benchmarks targeted at
measuring the most important MPI functions.
• Set forth a precise benchmark methodology.
•Report bare timings rather than provide
interpretation of the measured results. Show
throughput values if and only if these values are well-
defined. Intel® MPI Benchmarks is developed using
ANSI C plus standard MPI
Intel® MPI Benchmarks performs a set of MPI
global communication operations for a range
performance measurements for point-to-point and
of
message sizes. The generated benchmark data fully
characterizes:
23. •performance of a cluster system, including node
performance, network latency, and throughput
• efficiency of the MPI implementation used
The Intel® MPI Benchmarks package consists of the
following components:
• IMB-MPI1 - benchmarks for MPI-1 functions.
• Two components for MPI-2 functionality:
• IMB-EXT - one-sided communications
benchmarks.
• IMB-IO - input/output (I/O) benchmarks.
• Two components for MPI-3 functionality:
• IMB-NBC - benchmarks for nonblocking
collective (NBC) operations.
• IMB-RMA - one-sided communications
benchmarks. These benchmarks measure the Remote
Memory Access (RMA) functionality introduced in
the MPI-3 standard. Each component constitutes a
separate executable file. You can run all of the
supported benchmarks, or specify a single executable
file in the command line to get results for a specific
subset of benchmarks.
imb-refactor.sh runs IMB tests.
24. On a single node IMB tests are run with full
subscription. 16 nodes and 16 ppn = 256 processes. It
is run with MV2_USE_DPU=0 and
MV2_USE_DPU=1.
Then the tests are repeated on 2 Nodes and 4 Nodes
For two and four nodes maximum message size is
limited to 64KB and number of iterations are kept at
500. Here too tests are run with MV2_USE_DPU=0
and 1
Number of iterations for all the tests is set to 500.
Tests are run with “multi” option set to 1 and without
it.
This defines whether the benchmark runs in the
multiple mode or not.
All tests are run with the below options
COMMON="MV2_DEBUG_SHOW_BACKTRAC
E=2 MV2_ENABLE_AFFINITY=0
MV2_SUPPRESS_JOB_STARTUP_PERFORMAN
CE_WARNING=1 MPIEXEC_TIMEOUT=300"
MV2_DEBUG_SHOW_BACKTRACE
25. Show a backtrace when a process fails on errors like
”Segmentation faults”, ”Bus error”, ”Illegal
Instruction”, ”Abort” or ”Floating point exception”.
MV2_ENABLE_AFFINITY
Enable CPU affinity by setting
MV2_ENABLE_AFFINITY to 1 or disable it by
setting MV2_ENABLE_AFFINITY to 0.
Set this to limit, in seconds, of the execution time of
the mpi application. This overwritesthe
MV2_MPIRUN_TIMEOUT parameter.
The following table lists all IMB-NBC benchmarks:
29. Total Tests: (19 tests * 2 scenarios[host/dpu] * 2
modes [multi/non-multi] ) = 76
IMB-RMA Benchmarks
The table below lists all IMB-RMA benchmarks
30.
31. Total Tests: (19 tests * 2 scenarios[host/dpu] * 2
modes [multi/non-multi] ) = 76
MPICH Tests
The MVAPICH2 MPI library (by Network Based
Computing Laboratory at Ohio State University) is a
derivative of the MPICH (by Argonne National
Laboratory). There exists some tests originally from
MPICH that can be found in the ./test folder of a
fresh clone of the MVAPICH2 code. There are
folders with tests for multiple parts of the code--
pt2pt, collectives, rma, etc.
Each folder has a testlist file that can be read in by a
script to know which tests to run. Since the
32. MVAPICH2-DPU library at this time only has
support for dpu-based collectives, we are mainly
interested in passing tests within the ./test/mpi/coll
folder.
MPICH collective tests
Total Tests: (169 tests * 2 scenarios[host/dpu] ) = 338
There are many dense tests here such as
"alltoall", "bcasttest", "redscat", "red_scat_block",
"gather_big", "opprod"
"nbicbcast", "nbicallreduce", "nbic"
MPICH comm, pt2pt and rma are yet to be
investigated.
NAS Tests
The NAS Parallel Benchmarks (NPB) are a small set
of programs designed to help evaluate the
performance of parallel supercomputers. The
benchmarks are derived from computational fluid
dynamics (CFD) applications and consist of five
kernels and three pseudo-applications in the original
"pencil-and-paper" specification (NPB 1). The
33. benchmark suite has been extended to include new
benchmarks for unstructured adaptive meshes,
parallel I/O, multi-zone applications, and
computational grids. Problem sizes in NPB are
predefined and indicated as different classes.
Reference implementations of NPB are available in
commonly-used programming models like MPI and
OpenMP (NPB 2 and NPB 3).
NAS has 9 benchmarks. Following are NAS
benchmarks
"mg.B.x”, "cg.B.x", "ft.B.x", "lu.B.x",
"is.B.x","sp.B.x","bt.B.x.ep_io",
"bt.B.x.mpi_io_full"
"ep.B.x",
All benchmarks are run successfully on 2 nodes with
2ppn.
It is yet to be scaled up to 4,8 and 16 nodes with full
subscription.
References
1.OMB user guide
2.IMB user guide
3.Wiki
End of Document