uCluster (micro-Cluster) is a toy computer cluster composed of 3 Raspberry Pi boards, 2 NVIDIA Jetson Nano boards and 1 NVIDIA Jetson TX2 board.
The presentation shows how to build the uCluster and focuses on few interesting technologies for further consideration when building a cluster at any scale.
The project is for educational purposes and tinkering with various technologies.
6. Jetson Nano
Jetson TX2 Series
Jetson Xavier NX Jetson AGX Xavier Series
TX2 4GB TX2 TX2i
AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16)
1.26
TFLOPs
(FP16)
21 TOPs (INT8) 32 TOPs (INT8)
GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU
384-core NVIDIA Volta GPU with
48 Tensor Cores
512-core NVIDIA Volta GPU with
64 Tensor Cores
CPU Quad-Core ARM Cortex-A57
Dual-Core NVIDIA Denver 1.5
64-Bit CPU and Quad-Core
ARM Cortex-A57
6-core NVIDIA Carmel ARM v8.2
64-bit CPU
6MB L2 + 4MB L3
8-core NVIDIA Carmel Arm v8.2
64-bit CPU
8MB L2 + 4MB L3
Memory
4 GB 64-bit LPDDR4
25.6GB/s
4 GB 128-
bit
LPDDR4
51.2GB/s
8 GB 128-
bit
LPDDR4
59.7GB/s
8 GB 128-
bit
LPDDR4
(ECC
Support)
51.2GB/s
8 GB 128-bit LPDDR4x
51.2GB/s
32 GB 256-bit LPDDR4x
136.5GB/s
Storage 16 GB eMMC 5.1
16 GB
eMMC 5.1
32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1
Power 5W / 10W 7.5W / 15W
10W /
20W
10W / 15W 10W / 15W / 30W
Networking 10/100/1000 BASE-T Ethernet
10/100/1
000 BASE-
T
Ethernet,
WLAN
10/100/1000 BASE-T Ethernet
Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores
Jetson family Specs
https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second
7. Raspberry Pi 3 Model B
• Quad Core 1.2GHz Broadcom BCM2837 64bit CPU
• 1GB RAM
• BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board
• 100 Base Ethernet
• Micro SD port for loading your operating system and storing data
μCluster components
https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/
8. Raspberry Pi latest version
https://www.raspberrypi.org/products/raspberry-pi-4-model-b/
11. Steps to build the μCluster
1. Install OS (in just one board)
2. Install software (in just one board)
3. Copy the image (OS+software) to all the other boards
4. Setup static ips
5. Setup NFS (Network File System)
6. Run your MPI applications using a hostfile
mpirun -np # --hostfile myHostFile ./executable $(arguments)
j01 slots=4 max-slots=4 # Jetson TX2
j02 slots=4 max-slots=4 # Jetson Nano
j03 slots=4 max-slots=4 # Jetson Nano
j04 slots=4 max-slots=4 # Raspberry Pi
j05 slots=4 max-slots=4 # Raspberry Pi
j06 slots=4 max-slots=4 # Raspberry Pi
myHostFile
ips of the available infrastructure
(can be found in /etc/hosts & /etc/hostname)
https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/#
SLURM
as an alternative
(Resource Manager)
12. NFS vs parallel FS
Storage
n0
n1 nm…
NOT SCALABLE
NFS
Storage
n0
n1 nm…
Storage
nk
…
SCALABLE
PFS
Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015
https://developer.ibm.com/tutorials/l-network-filesystems/
Hotspot
No Stress for the Network!
13. parallel FS
Metadata services store namespace
metadata, such as filenames, directories,
access permissions, and file layout
Object storage contains actual file data.
Clients pull the location of files and
directories from the metadata services,
then access file storage directly.
Popular PFS:
Lustre
BeeGFS
GlusterFS
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
14. parallel FS
Lustre
Metadata services Object storage
IOPs: I/O operations per second
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
No Stress for the Network!
15. μCluster benchmarks:
Palabos-npFEM
Red Blood Cells (red bodies) : 272
Platelets (yellow bodies) : 27
Reference case study
Hematocrit 20%, box 50x50x50 μm3
https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)
25. Interconnect/Network Technologies
Absolute Priorities
High Bandwidth
Low Latency but not only
RDMA
Combined with
smartNICs/FPGAs
(try to reduce the stress from CPUs)
RDMA over Converged Ethernet (RoCE) is a
network protocol that allows remote direct
memory access over an Ethernet network. It
does this by encapsulating an IB transport
packet over Ethernet.
https://prace-ri.eu/training-support/best-practice-guides/
Ethernet uses a hierarchical topology which
involves more computing power by the CPU
in contrast to the flat fabric topology of IB
where the data is directly moved by the
network card using RDMA requests,
reducing the CPU involvement
26. Interconnect/Network Technologies
GPU-aware MPI - NVIDIA GPUDirect (family of technologies)
GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory
• GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit
- through PCIe or NVLink)
• GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support)
• MVAPICH, OpenMPI, Cray MPI, ...
How it works (@CSCS)
▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1
▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes
that all pointers point at host memory, and your application will probably crash with seg faults
https://www.cscs.ch/
28. Interconnect/Network Technologies
Route to Exascale: Network Acceleration
sPIN: High-performance streaming Processing in the Network
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell
• Today’s network cards contain rather
powerful processors optimized for data
movement
• Offload simple packet processing
functions to the network card
• Portable packet-processing network
acceleration model similar to compute
acceleration with CUDA or OpenCL
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
29. Interconnect/Network Technologies
Route to Exascale & HPC network topologies
https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf
Fat trees
Torus
Dragonfly
Good luck with the scalability on
exascale machines without smart
components/accelerators at the
hotspots (compute, network, storage)
33. NVIDIA Ampere NVIDIA Turing NVIDIA Volta
Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16
Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8
GPU Architecture: Tensor Cores
Each Tensor Core operates on a 4x4 matrix and performs the following operation:
In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in
an SM perform a total of 512 FMA operations
*FMA (Fused Multiply-Accumulate)
https://www.nvidia.com/
Vectorization like in CPUs
34. GPU Architecture: Tensor Cores
How to use/activate them:
▪ The code simply needs to use a flag to tell the API and drivers that you want to
use tensor cores, the data type needs to be one supported by the cores, and
the dimensions of the matrices need to be a multiple of 8. After that, that
hardware will handle everything else.
▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries
▪ For now avoid the API to access them (ugly like CPU-vectorization)
Multiple of 8 for tensor cores
&
Multiple of 32 for vanilla CUDA cores
(warp scheduling)
35. // First, create a cuBLAS handle:
cublasStatus_t cublasStat = cublasCreate(&handle);
// Set the math mode to allow cuBLAS to use Tensor Cores:
cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
// Allocate and initialize your matrices (only the A matrix is shown):
size_t matrixSizeA = (size_t)rowsA * colsA;
T_ELEM_IN **devPtrA = 0;
cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0]));
T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0]));
memset( A, 0xFF, matrixSizeA* sizeof(A[0]));
status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA);
// ... allocate and initialize B and C matrices (not shown) ...
// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8,
// and m is a multiple of 4:
cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha,
A, CUDA_R_16F, lda,
B, CUDA_R_16F, ldb,
beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);
Few Rules
• The routine must be a GEMM; currently,
only GEMMs support Tensor Core
execution.
• GEMMs that do not satisfy the rules will
fall back to a non-Tensor Core
implementation.
https://www.nvidia.com/
https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
GPU Architecture: Tensor Cores (cuBLAS example)