uCluster

Kotsalos Christos
Scientific and Parallel Computing Group (SPC)
μCluster +
Supercomputing/Cluster Technologies

μCluster
2 NVIDIA Jetson Nano
Boards
3 Raspberry Pi
Boards
1 NVIDIA Jetson TX2
Board
1 Network Switch
16-ports
Support 1 Gbit/s
Cat. 6
Network Cables
Support 1 Gbit/s

GPU 128-Cuda Core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD
Connectivity Gigabit Ethernet
Display HDMI and display port
USB 4x USB 3.0, USB 2.0 Micro-B
μCluster components
https://developer.nvidia.com/embedded/jetson-nano-developer-kit

Technical Specifications
JETSON TX2 MODULE
• NVIDIA Pascal Architecture GPU
• 2 Denver 64-bit CPUs + Quad-Core A57
• 8 GB L128 bit DDR4 Memory
• 32 GB eMMC 5.1 Flash Storage
• Connectivity to 802.11ac Wi-Fi and Bluetooth-Enabled Devices
• 10/100/1000BASE-T Ethernet
I/O
• USB 3.0 Type A
• USB 2.0 Micro AB
• HDMI
• Gigabit Ethernet
• Full-Size SD
• SATA Data and Power
POWER OPTIONS
• External 19V AC Adapter
https://developer.nvidia.com/embedded/jetson-tx2-developer-kit

Developer Kits
Modules
Jetson family
Marketed for AI apps
(Tensor cores)
https://developer.nvidia.com/embedded/jetson-modules

Jetson Nano
Jetson TX2 Series
Jetson Xavier NX Jetson AGX Xavier Series
TX2 4GB TX2 TX2i
AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16)
1.26
TFLOPs
(FP16)
21 TOPs (INT8) 32 TOPs (INT8)
GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU
384-core NVIDIA Volta GPU with
48 Tensor Cores
512-core NVIDIA Volta GPU with
64 Tensor Cores
CPU Quad-Core ARM Cortex-A57
Dual-Core NVIDIA Denver 1.5
64-Bit CPU and Quad-Core
ARM Cortex-A57
6-core NVIDIA Carmel ARM v8.2
64-bit CPU
6MB L2 + 4MB L3
8-core NVIDIA Carmel Arm v8.2
64-bit CPU
8MB L2 + 4MB L3
Memory
4 GB 64-bit LPDDR4
25.6GB/s
4 GB 128-
bit
LPDDR4
51.2GB/s
8 GB 128-
bit
LPDDR4
59.7GB/s
8 GB 128-
bit
LPDDR4
(ECC
Support)
51.2GB/s
8 GB 128-bit LPDDR4x
51.2GB/s
32 GB 256-bit LPDDR4x
136.5GB/s
Storage 16 GB eMMC 5.1
16 GB
eMMC 5.1
32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1
Power 5W / 10W 7.5W / 15W
10W /
20W
10W / 15W 10W / 15W / 30W
Networking 10/100/1000 BASE-T Ethernet
10/100/1
000 BASE-
T
Ethernet,
WLAN
10/100/1000 BASE-T Ethernet
Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores
Jetson family Specs
https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second

Raspberry Pi 3 Model B
• Quad Core 1.2GHz Broadcom BCM2837 64bit CPU
• 1GB RAM
• BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board
• 100 Base Ethernet
• Micro SD port for loading your operating system and storing data
https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/

Raspberry Pi latest version
https://www.raspberrypi.org/products/raspberry-pi-4-model-b/

Turing Pi (alternative path)
https://turingpi.com/v1/
Raspberry Pi
Compute Module

Turing Pi (alternative path)
https://turingpi.com/v1/

Steps to build the μCluster
1. Install OS (in just one board)
2. Install software (in just one board)
3. Copy the image (OS+software) to all the other boards
4. Setup static ips
5. Setup NFS (Network File System)
6. Run your MPI applications using a hostfile
mpirun -np # --hostfile myHostFile ./executable $(arguments)
j01 slots=4 max-slots=4 # Jetson TX2
j02 slots=4 max-slots=4 # Jetson Nano
j03 slots=4 max-slots=4 # Jetson Nano
j04 slots=4 max-slots=4 # Raspberry Pi
myHostFile
ips of the available infrastructure
(can be found in /etc/hosts & /etc/hostname)
https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/#
SLURM
as an alternative
(Resource Manager)

NFS vs parallel FS
Storage
n0
n1 nm…
NOT SCALABLE
NFS
Storage
n0
n1 nm…
Storage
nk
…
SCALABLE
PFS
Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015
https://developer.ibm.com/tutorials/l-network-filesystems/
Hotspot
No Stress for the Network!

parallel FS
Metadata services store namespace
metadata, such as filenames, directories,
access permissions, and file layout
Object storage contains actual file data.
Clients pull the location of files and
directories from the metadata services,
then access file storage directly.
Popular PFS:
Lustre
BeeGFS
GlusterFS
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223

parallel FS
Lustre
Metadata services Object storage
IOPs: I/O operations per second
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
No Stress for the Network!

μCluster benchmarks:
Palabos-npFEM
Red Blood Cells (red bodies) : 272
Platelets (yellow bodies) : 27
Reference case study
Hematocrit 20%, box 50x50x50 μm3
https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)

Palabos-npFEM
x6.5 x8.1
Promising alternative considering:
• Energy consumption
• Low cost of components

Palabos-npFEM
Intra-node
communication
2 nodes -
Gbit Ethernet

Network Performance
OSU Micro-Benchmarks
Testing 2 nodes
osu_get_bw:
Bandwidth Test
MPI_Isend & MPI_Irecv

Network Performance
Testing 2 nodes
osu_get_latency:
Latency Test
MPI_Send & MPI_Recv
ping-pong test

μCluster benchmarks: Network Performance mpiGraph
Piz Daint BaobabμCluster

Bandwidth
Packet size
Network Performance
https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/
Importance of:
• MPI implementation
• Network capabilities

Interconnect/Network Technologies
InfiniBand
Ethernet
Omni-Path
TOP500 June 2020
https://www.top500.org/

SDR DDR QDR FDR10 FDR EDR HDR NDR XDR
Theoretical
effective
throughput
(Gbit/s)
for 1
link
2 4 8 10 13.64 25 50 100 250
for 4
links
8 16 32 40 54.54 100 200 400 1000
for 8
links
16 32 64 80 109.08 200 400 800 2000
for 12
links
24 48 96 120 163.64 300 600 1200 3000
Adapter latency (µs) 5 2.5 1.3 0.7 0.7 0.5 less? t.b.d. t.b.d.
Year
2001
2003
2005 2007 2011 2011 2014 2017
after
2020
after
2023?
InfiniBand (IB)
Omni-Path (Intel)
Aries (Cray)
copper or fiber optic wires
Each interconnect technology
comes with its own hardware:
• NIC: Network Interface Card
• Switch: Create Subnets
• Router: Link Subnets
• Bridge/Gateway: Bridge different
networks (IB & Ethernet)
Ethernethttps://en.wikipedia.org/wiki/InfiniBand

NVLink (NVIDIA)
GPU-to-GPU data transfers at up to 160 Gbytes/s of bidirectional bandwidth, 5x the bandwidth of PCIe
https://prace-ri.eu/training-support/best-practice-guides/

Absolute Priorities
High Bandwidth
Low Latency but not only
RDMA
Combined with
smartNICs/FPGAs
(try to reduce the stress from CPUs)
RDMA over Converged Ethernet (RoCE) is a
network protocol that allows remote direct
memory access over an Ethernet network. It
does this by encapsulating an IB transport
packet over Ethernet.
https://prace-ri.eu/training-support/best-practice-guides/
Ethernet uses a hierarchical topology which
involves more computing power by the CPU
in contrast to the flat fabric topology of IB
where the data is directly moved by the
network card using RDMA requests,
reducing the CPU involvement

GPU-aware MPI - NVIDIA GPUDirect (family of technologies)
GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory
• GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit
- through PCIe or NVLink)
• GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support)
• MVAPICH, OpenMPI, Cray MPI, ...
How it works (@CSCS)
▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1
▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes
that all pointers point at host memory, and your application will probably crash with seg faults
https://www.cscs.ch/

https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)

Route to Exascale: Network Acceleration
sPIN: High-performance streaming Processing in the Network
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell
• Today’s network cards contain rather
powerful processors optimized for data
movement
• Offload simple packet processing
functions to the network card
• Portable packet-processing network
acceleration model similar to compute
acceleration with CUDA or OpenCL
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)

Route to Exascale & HPC network topologies
https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf
Fat trees
Torus
Dragonfly
Good luck with the scalability on
exascale machines without smart
components/accelerators at the
hotspots (compute, network, storage)

NVIDIA GPU Micro-architecture
(chronologically)
1. Tesla
2. Fermi
3. Kepler
4. Maxwell
5. Pascal (CSCS)
6. Volta (Summit ORNL)
7. Turing
8. Ampere
GPU Architecture

GPU Architecture
https://www.nvidia.com/

GPU Architecture
Evolution
Pascal Streaming Multiprocessor
Ampere Streaming Multiprocessor

NVIDIA Ampere NVIDIA Turing NVIDIA Volta
Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16
Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8
GPU Architecture: Tensor Cores
Each Tensor Core operates on a 4x4 matrix and performs the following operation:
In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in
an SM perform a total of 512 FMA operations
*FMA (Fused Multiply-Accumulate)
Vectorization like in CPUs

GPU Architecture: Tensor Cores
How to use/activate them:
▪ The code simply needs to use a flag to tell the API and drivers that you want to
use tensor cores, the data type needs to be one supported by the cores, and
the dimensions of the matrices need to be a multiple of 8. After that, that
hardware will handle everything else.
▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries
▪ For now avoid the API to access them (ugly like CPU-vectorization)
Multiple of 8 for tensor cores
&
Multiple of 32 for vanilla CUDA cores
(warp scheduling)

// First, create a cuBLAS handle:
cublasStatus_t cublasStat = cublasCreate(&handle);
// Set the math mode to allow cuBLAS to use Tensor Cores:
cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
// Allocate and initialize your matrices (only the A matrix is shown):
size_t matrixSizeA = (size_t)rowsA * colsA;
T_ELEM_IN **devPtrA = 0;
cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0]));
T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0]));
memset( A, 0xFF, matrixSizeA* sizeof(A[0]));
status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA);
// ... allocate and initialize B and C matrices (not shown) ...
// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8,
// and m is a multiple of 4:
cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha,
A, CUDA_R_16F, lda,
B, CUDA_R_16F, ldb,
beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);
Few Rules
• The routine must be a GEMM; currently,
only GEMMs support Tensor Core
execution.
• GEMMs that do not satisfy the rules will
fall back to a non-Tensor Core
implementation.
https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
GPU Architecture: Tensor Cores (cuBLAS example)

Demo Time!
Future considerations:
• Hardware Homogeneity
• Fast Network
• Parallel Storage

uCluster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to uCluster

Similar to uCluster (20)

Recently uploaded

Recently uploaded (20)

uCluster