SlideShare a Scribd company logo
1 of 36
Download to read offline
Kotsalos Christos
Scientific and Parallel Computing Group (SPC)
μCluster +
Supercomputing/Cluster Technologies
μCluster
2 NVIDIA Jetson Nano
Boards
3 Raspberry Pi
Boards
1 NVIDIA Jetson TX2
Board
1 Network Switch
16-ports
Support 1 Gbit/s
Cat. 6
Network Cables
Support 1 Gbit/s
GPU 128-Cuda Core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD
Connectivity Gigabit Ethernet
Display HDMI and display port
USB 4x USB 3.0, USB 2.0 Micro-B
μCluster components
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
Technical Specifications
JETSON TX2 MODULE
• NVIDIA Pascal Architecture GPU
• 2 Denver 64-bit CPUs + Quad-Core A57
• 8 GB L128 bit DDR4 Memory
• 32 GB eMMC 5.1 Flash Storage
• Connectivity to 802.11ac Wi-Fi and Bluetooth-Enabled Devices
• 10/100/1000BASE-T Ethernet
I/O
• USB 3.0 Type A
• USB 2.0 Micro AB
• HDMI
• Gigabit Ethernet
• Full-Size SD
• SATA Data and Power
POWER OPTIONS
• External 19V AC Adapter
μCluster components
https://developer.nvidia.com/embedded/jetson-tx2-developer-kit
Developer Kits
Modules
Jetson family
Marketed for AI apps
(Tensor cores)
https://developer.nvidia.com/embedded/jetson-modules
Jetson Nano
Jetson TX2 Series
Jetson Xavier NX Jetson AGX Xavier Series
TX2 4GB TX2 TX2i
AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16)
1.26
TFLOPs
(FP16)
21 TOPs (INT8) 32 TOPs (INT8)
GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU
384-core NVIDIA Volta GPU with
48 Tensor Cores
512-core NVIDIA Volta GPU with
64 Tensor Cores
CPU Quad-Core ARM Cortex-A57
Dual-Core NVIDIA Denver 1.5
64-Bit CPU and Quad-Core
ARM Cortex-A57
6-core NVIDIA Carmel ARM v8.2
64-bit CPU
6MB L2 + 4MB L3
8-core NVIDIA Carmel Arm v8.2
64-bit CPU
8MB L2 + 4MB L3
Memory
4 GB 64-bit LPDDR4
25.6GB/s
4 GB 128-
bit
LPDDR4
51.2GB/s
8 GB 128-
bit
LPDDR4
59.7GB/s
8 GB 128-
bit
LPDDR4
(ECC
Support)
51.2GB/s
8 GB 128-bit LPDDR4x
51.2GB/s
32 GB 256-bit LPDDR4x
136.5GB/s
Storage 16 GB eMMC 5.1
16 GB
eMMC 5.1
32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1
Power 5W / 10W 7.5W / 15W
10W /
20W
10W / 15W 10W / 15W / 30W
Networking 10/100/1000 BASE-T Ethernet
10/100/1
000 BASE-
T
Ethernet,
WLAN
10/100/1000 BASE-T Ethernet
Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores
Jetson family Specs
https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second
Raspberry Pi 3 Model B
• Quad Core 1.2GHz Broadcom BCM2837 64bit CPU
• 1GB RAM
• BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board
• 100 Base Ethernet
• Micro SD port for loading your operating system and storing data
μCluster components
https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/
Raspberry Pi latest version
https://www.raspberrypi.org/products/raspberry-pi-4-model-b/
Turing Pi (alternative path)
https://turingpi.com/v1/
Raspberry Pi
Compute Module
Turing Pi (alternative path)
https://turingpi.com/v1/
Steps to build the μCluster
1. Install OS (in just one board)
2. Install software (in just one board)
3. Copy the image (OS+software) to all the other boards
4. Setup static ips
5. Setup NFS (Network File System)
6. Run your MPI applications using a hostfile
mpirun -np # --hostfile myHostFile ./executable $(arguments)
j01 slots=4 max-slots=4 # Jetson TX2
j02 slots=4 max-slots=4 # Jetson Nano
j03 slots=4 max-slots=4 # Jetson Nano
j04 slots=4 max-slots=4 # Raspberry Pi
j05 slots=4 max-slots=4 # Raspberry Pi
j06 slots=4 max-slots=4 # Raspberry Pi
myHostFile
ips of the available infrastructure
(can be found in /etc/hosts & /etc/hostname)
https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/#
SLURM
as an alternative
(Resource Manager)
NFS vs parallel FS
Storage
n0
n1 nm…
NOT SCALABLE
NFS
Storage
n0
n1 nm…
Storage
nk
…
SCALABLE
PFS
Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015
https://developer.ibm.com/tutorials/l-network-filesystems/
Hotspot
No Stress for the Network!
parallel FS
Metadata services store namespace
metadata, such as filenames, directories,
access permissions, and file layout
Object storage contains actual file data.
Clients pull the location of files and
directories from the metadata services,
then access file storage directly.
Popular PFS:
Lustre
BeeGFS
GlusterFS
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
parallel FS
Lustre
Metadata services Object storage
IOPs: I/O operations per second
https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/
https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
No Stress for the Network!
μCluster benchmarks:
Palabos-npFEM
Red Blood Cells (red bodies) : 272
Platelets (yellow bodies) : 27
Reference case study
Hematocrit 20%, box 50x50x50 μm3
https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)
μCluster benchmarks:
Palabos-npFEM
x6.5 x8.1
Promising alternative considering:
• Energy consumption
• Low cost of components
μCluster benchmarks:
Palabos-npFEM
Intra-node
communication
2 nodes -
Gbit Ethernet
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
Testing 2 nodes
osu_get_bw:
Bandwidth Test
MPI_Isend & MPI_Irecv
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
Testing 2 nodes
osu_get_latency:
Latency Test
MPI_Send & MPI_Recv
ping-pong test
μCluster benchmarks: Network Performance mpiGraph
Piz Daint BaobabμCluster
Bandwidth
Packet size
μCluster benchmarks:
Network Performance
OSU Micro-Benchmarks
https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/
Importance of:
• MPI implementation
• Network capabilities
Interconnect/Network Technologies
InfiniBand
Ethernet
Omni-Path
TOP500 June 2020
https://www.top500.org/
SDR DDR QDR FDR10 FDR EDR HDR NDR XDR
Theoretical
effective
throughput
(Gbit/s)
for 1
link
2 4 8 10 13.64 25 50 100 250
for 4
links
8 16 32 40 54.54 100 200 400 1000
for 8
links
16 32 64 80 109.08 200 400 800 2000
for 12
links
24 48 96 120 163.64 300 600 1200 3000
Adapter latency (µs) 5 2.5 1.3 0.7 0.7 0.5 less? t.b.d. t.b.d.
Year
2001
2003
2005 2007 2011 2011 2014 2017
after
2020
after
2023?
Interconnect/Network Technologies
InfiniBand (IB)
Omni-Path (Intel)
Aries (Cray)
copper or fiber optic wires
Each interconnect technology
comes with its own hardware:
• NIC: Network Interface Card
• Switch: Create Subnets
• Router: Link Subnets
• Bridge/Gateway: Bridge different
networks (IB & Ethernet)
Ethernethttps://en.wikipedia.org/wiki/InfiniBand
Interconnect/Network Technologies
NVLink (NVIDIA)
GPU-to-GPU data transfers at up to 160 Gbytes/s of bidirectional bandwidth, 5x the bandwidth of PCIe
https://prace-ri.eu/training-support/best-practice-guides/
Interconnect/Network Technologies
Absolute Priorities
High Bandwidth
Low Latency but not only
RDMA
Combined with
smartNICs/FPGAs
(try to reduce the stress from CPUs)
RDMA over Converged Ethernet (RoCE) is a
network protocol that allows remote direct
memory access over an Ethernet network. It
does this by encapsulating an IB transport
packet over Ethernet.
https://prace-ri.eu/training-support/best-practice-guides/
Ethernet uses a hierarchical topology which
involves more computing power by the CPU
in contrast to the flat fabric topology of IB
where the data is directly moved by the
network card using RDMA requests,
reducing the CPU involvement
Interconnect/Network Technologies
GPU-aware MPI - NVIDIA GPUDirect (family of technologies)
GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory
• GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit
- through PCIe or NVLink)
• GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support)
• MVAPICH, OpenMPI, Cray MPI, ...
How it works (@CSCS)
▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1
▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes
that all pointers point at host memory, and your application will probably crash with seg faults
https://www.cscs.ch/
Interconnect/Network Technologies
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
Interconnect/Network Technologies
Route to Exascale: Network Acceleration
sPIN: High-performance streaming Processing in the Network
Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell
• Today’s network cards contain rather
powerful processors optimized for data
movement
• Offload simple packet processing
functions to the network card
• Portable packet-processing network
acceleration model similar to compute
acceleration with CUDA or OpenCL
https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
Interconnect/Network Technologies
Route to Exascale & HPC network topologies
https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf
Fat trees
Torus
Dragonfly
Good luck with the scalability on
exascale machines without smart
components/accelerators at the
hotspots (compute, network, storage)
NVIDIA GPU Micro-architecture
(chronologically)
1. Tesla
2. Fermi
3. Kepler
4. Maxwell
5. Pascal (CSCS)
6. Volta (Summit ORNL)
7. Turing
8. Ampere
GPU Architecture
GPU Architecture
https://www.nvidia.com/
GPU Architecture
Evolution
Pascal Streaming Multiprocessor
Ampere Streaming Multiprocessor
https://www.nvidia.com/
NVIDIA Ampere NVIDIA Turing NVIDIA Volta
Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16
Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8
GPU Architecture: Tensor Cores
Each Tensor Core operates on a 4x4 matrix and performs the following operation:
In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in
an SM perform a total of 512 FMA operations
*FMA (Fused Multiply-Accumulate)
https://www.nvidia.com/
Vectorization like in CPUs
GPU Architecture: Tensor Cores
How to use/activate them:
▪ The code simply needs to use a flag to tell the API and drivers that you want to
use tensor cores, the data type needs to be one supported by the cores, and
the dimensions of the matrices need to be a multiple of 8. After that, that
hardware will handle everything else.
▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries
▪ For now avoid the API to access them (ugly like CPU-vectorization)
Multiple of 8 for tensor cores
&
Multiple of 32 for vanilla CUDA cores
(warp scheduling)
// First, create a cuBLAS handle:
cublasStatus_t cublasStat = cublasCreate(&handle);
// Set the math mode to allow cuBLAS to use Tensor Cores:
cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
// Allocate and initialize your matrices (only the A matrix is shown):
size_t matrixSizeA = (size_t)rowsA * colsA;
T_ELEM_IN **devPtrA = 0;
cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0]));
T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0]));
memset( A, 0xFF, matrixSizeA* sizeof(A[0]));
status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA);
// ... allocate and initialize B and C matrices (not shown) ...
// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8,
// and m is a multiple of 4:
cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha,
A, CUDA_R_16F, lda,
B, CUDA_R_16F, ldb,
beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);
Few Rules
• The routine must be a GEMM; currently,
only GEMMs support Tensor Core
execution.
• GEMMs that do not satisfy the rules will
fall back to a non-Tensor Core
implementation.
https://www.nvidia.com/
https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
GPU Architecture: Tensor Cores (cuBLAS example)
Demo Time!
Future considerations:
• Hardware Homogeneity
• Fast Network
• Parallel Storage

More Related Content

What's hot

DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...Jim St. Leger
 
Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016SZ Lin
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Dohertyharryvanhaaren
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Grayharryvanhaaren
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on LabMichelle Holley
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersMichelle Holley
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesJim St. Leger
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvIntel
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 

What's hot (20)

DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Doherty
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
DPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith WilesDPDK Summit 2015 - Intel - Keith Wiles
DPDK Summit 2015 - Intel - Keith Wiles
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 

Similar to uCluster

Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Haidee McMahon
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationVEDLIoT Project
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Parallel Rendering of Webpages
Parallel Rendering of WebpagesParallel Rendering of Webpages
Parallel Rendering of WebpagesLangtech
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)The Linux Foundation
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 

Similar to uCluster (20)

Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Parallel Rendering of Webpages
Parallel Rendering of WebpagesParallel Rendering of Webpages
Parallel Rendering of Webpages
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
The Universal Dataplane
The Universal DataplaneThe Universal Dataplane
The Universal Dataplane
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
 
Tos tutorial
Tos tutorialTos tutorial
Tos tutorial
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

uCluster

  • 1. Kotsalos Christos Scientific and Parallel Computing Group (SPC) μCluster + Supercomputing/Cluster Technologies
  • 2. μCluster 2 NVIDIA Jetson Nano Boards 3 Raspberry Pi Boards 1 NVIDIA Jetson TX2 Board 1 Network Switch 16-ports Support 1 Gbit/s Cat. 6 Network Cables Support 1 Gbit/s
  • 3. GPU 128-Cuda Core Maxwell CPU Quad-core ARM A57 @ 1.43 GHz Memory 4 GB 64-bit LPDDR4 25.6 GB/s Storage microSD Connectivity Gigabit Ethernet Display HDMI and display port USB 4x USB 3.0, USB 2.0 Micro-B μCluster components https://developer.nvidia.com/embedded/jetson-nano-developer-kit
  • 4. Technical Specifications JETSON TX2 MODULE • NVIDIA Pascal Architecture GPU • 2 Denver 64-bit CPUs + Quad-Core A57 • 8 GB L128 bit DDR4 Memory • 32 GB eMMC 5.1 Flash Storage • Connectivity to 802.11ac Wi-Fi and Bluetooth-Enabled Devices • 10/100/1000BASE-T Ethernet I/O • USB 3.0 Type A • USB 2.0 Micro AB • HDMI • Gigabit Ethernet • Full-Size SD • SATA Data and Power POWER OPTIONS • External 19V AC Adapter μCluster components https://developer.nvidia.com/embedded/jetson-tx2-developer-kit
  • 5. Developer Kits Modules Jetson family Marketed for AI apps (Tensor cores) https://developer.nvidia.com/embedded/jetson-modules
  • 6. Jetson Nano Jetson TX2 Series Jetson Xavier NX Jetson AGX Xavier Series TX2 4GB TX2 TX2i AI Performance 472 GFLOPs (FP16) 1.33 TFLOPs (FP16) 1.26 TFLOPs (FP16) 21 TOPs (INT8) 32 TOPs (INT8) GPU 128-core NVIDIA Maxwell GPU 256-core NVIDIA Pascal GPU 384-core NVIDIA Volta GPU with 48 Tensor Cores 512-core NVIDIA Volta GPU with 64 Tensor Cores CPU Quad-Core ARM Cortex-A57 Dual-Core NVIDIA Denver 1.5 64-Bit CPU and Quad-Core ARM Cortex-A57 6-core NVIDIA Carmel ARM v8.2 64-bit CPU 6MB L2 + 4MB L3 8-core NVIDIA Carmel Arm v8.2 64-bit CPU 8MB L2 + 4MB L3 Memory 4 GB 64-bit LPDDR4 25.6GB/s 4 GB 128- bit LPDDR4 51.2GB/s 8 GB 128- bit LPDDR4 59.7GB/s 8 GB 128- bit LPDDR4 (ECC Support) 51.2GB/s 8 GB 128-bit LPDDR4x 51.2GB/s 32 GB 256-bit LPDDR4x 136.5GB/s Storage 16 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1 Power 5W / 10W 7.5W / 15W 10W / 20W 10W / 15W 10W / 15W / 30W Networking 10/100/1000 BASE-T Ethernet 10/100/1 000 BASE- T Ethernet, WLAN 10/100/1000 BASE-T Ethernet Tesla P100-PCIE-16GB: 16GB RAM, 3584 CUDA Cores Jetson family Specs https://developer.nvidia.com/embedded/jetson-modules TOPS: Tera-Operations per Second
  • 7. Raspberry Pi 3 Model B • Quad Core 1.2GHz Broadcom BCM2837 64bit CPU • 1GB RAM • BCM43438 wireless LAN and Bluetooth Low Energy (BLE) on board • 100 Base Ethernet • Micro SD port for loading your operating system and storing data μCluster components https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/
  • 8. Raspberry Pi latest version https://www.raspberrypi.org/products/raspberry-pi-4-model-b/
  • 9. Turing Pi (alternative path) https://turingpi.com/v1/ Raspberry Pi Compute Module
  • 10. Turing Pi (alternative path) https://turingpi.com/v1/
  • 11. Steps to build the μCluster 1. Install OS (in just one board) 2. Install software (in just one board) 3. Copy the image (OS+software) to all the other boards 4. Setup static ips 5. Setup NFS (Network File System) 6. Run your MPI applications using a hostfile mpirun -np # --hostfile myHostFile ./executable $(arguments) j01 slots=4 max-slots=4 # Jetson TX2 j02 slots=4 max-slots=4 # Jetson Nano j03 slots=4 max-slots=4 # Jetson Nano j04 slots=4 max-slots=4 # Raspberry Pi j05 slots=4 max-slots=4 # Raspberry Pi j06 slots=4 max-slots=4 # Raspberry Pi myHostFile ips of the available infrastructure (can be found in /etc/hosts & /etc/hostname) https://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/# SLURM as an alternative (Resource Manager)
  • 12. NFS vs parallel FS Storage n0 n1 nm… NOT SCALABLE NFS Storage n0 n1 nm… Storage nk … SCALABLE PFS Experiences on File Systems Which is the best file system for you? Jakob Blomer CERN PH/SFT 2015 https://developer.ibm.com/tutorials/l-network-filesystems/ Hotspot No Stress for the Network!
  • 13. parallel FS Metadata services store namespace metadata, such as filenames, directories, access permissions, and file layout Object storage contains actual file data. Clients pull the location of files and directories from the metadata services, then access file storage directly. Popular PFS: Lustre BeeGFS GlusterFS https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/ https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223
  • 14. parallel FS Lustre Metadata services Object storage IOPs: I/O operations per second https://azure.microsoft.com/en-ca/resources/parallel-virtual-file-systems-on-microsoft-azure/ https://techcommunity.microsoft.com/t5/azure-global/parallel-file-systems-for-hpc-storage-on-azure/ba-p/306223 No Stress for the Network!
  • 15. μCluster benchmarks: Palabos-npFEM Red Blood Cells (red bodies) : 272 Platelets (yellow bodies) : 27 Reference case study Hematocrit 20%, box 50x50x50 μm3 https://gitlab.com/unigespc/palabos.git (coupledSimulators/npFEM)
  • 16. μCluster benchmarks: Palabos-npFEM x6.5 x8.1 Promising alternative considering: • Energy consumption • Low cost of components
  • 18. μCluster benchmarks: Network Performance OSU Micro-Benchmarks Testing 2 nodes osu_get_bw: Bandwidth Test MPI_Isend & MPI_Irecv
  • 19. μCluster benchmarks: Network Performance OSU Micro-Benchmarks Testing 2 nodes osu_get_latency: Latency Test MPI_Send & MPI_Recv ping-pong test
  • 20. μCluster benchmarks: Network Performance mpiGraph Piz Daint BaobabμCluster
  • 21. Bandwidth Packet size μCluster benchmarks: Network Performance OSU Micro-Benchmarks https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/ Importance of: • MPI implementation • Network capabilities
  • 23. SDR DDR QDR FDR10 FDR EDR HDR NDR XDR Theoretical effective throughput (Gbit/s) for 1 link 2 4 8 10 13.64 25 50 100 250 for 4 links 8 16 32 40 54.54 100 200 400 1000 for 8 links 16 32 64 80 109.08 200 400 800 2000 for 12 links 24 48 96 120 163.64 300 600 1200 3000 Adapter latency (µs) 5 2.5 1.3 0.7 0.7 0.5 less? t.b.d. t.b.d. Year 2001 2003 2005 2007 2011 2011 2014 2017 after 2020 after 2023? Interconnect/Network Technologies InfiniBand (IB) Omni-Path (Intel) Aries (Cray) copper or fiber optic wires Each interconnect technology comes with its own hardware: • NIC: Network Interface Card • Switch: Create Subnets • Router: Link Subnets • Bridge/Gateway: Bridge different networks (IB & Ethernet) Ethernethttps://en.wikipedia.org/wiki/InfiniBand
  • 24. Interconnect/Network Technologies NVLink (NVIDIA) GPU-to-GPU data transfers at up to 160 Gbytes/s of bidirectional bandwidth, 5x the bandwidth of PCIe https://prace-ri.eu/training-support/best-practice-guides/
  • 25. Interconnect/Network Technologies Absolute Priorities High Bandwidth Low Latency but not only RDMA Combined with smartNICs/FPGAs (try to reduce the stress from CPUs) RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access over an Ethernet network. It does this by encapsulating an IB transport packet over Ethernet. https://prace-ri.eu/training-support/best-practice-guides/ Ethernet uses a hierarchical topology which involves more computing power by the CPU in contrast to the flat fabric topology of IB where the data is directly moved by the network card using RDMA requests, reducing the CPU involvement
  • 26. Interconnect/Network Technologies GPU-aware MPI - NVIDIA GPUDirect (family of technologies) GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory • GPUdirect Peer to Peer among GPUs on the same node (native support through drivers & CUDA toolkit - through PCIe or NVLink) • GPUdirect RDMA among GPUs on different nodes (specialized hardware with RDMA support) • MVAPICH, OpenMPI, Cray MPI, ... How it works (@CSCS) ▪ Set the environment variable: export MPICH_RDMA_ENABLED_CUDA=1 ▪ Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes that all pointers point at host memory, and your application will probably crash with seg faults https://www.cscs.ch/
  • 28. Interconnect/Network Technologies Route to Exascale: Network Acceleration sPIN: High-performance streaming Processing in the Network Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell • Today’s network cards contain rather powerful processors optimized for data movement • Offload simple packet processing functions to the network card • Portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL https://spcl.inf.ethz.ch/ (slide from Torsten Hoefler)
  • 29. Interconnect/Network Technologies Route to Exascale & HPC network topologies https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Modern-Interconnects.pdf Fat trees Torus Dragonfly Good luck with the scalability on exascale machines without smart components/accelerators at the hotspots (compute, network, storage)
  • 30. NVIDIA GPU Micro-architecture (chronologically) 1. Tesla 2. Fermi 3. Kepler 4. Maxwell 5. Pascal (CSCS) 6. Volta (Summit ORNL) 7. Turing 8. Ampere GPU Architecture
  • 32. GPU Architecture Evolution Pascal Streaming Multiprocessor Ampere Streaming Multiprocessor https://www.nvidia.com/
  • 33. NVIDIA Ampere NVIDIA Turing NVIDIA Volta Supported Tensor Core Precisions FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16 Supported CUDA Core Precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8 GPU Architecture: Tensor Cores Each Tensor Core operates on a 4x4 matrix and performs the following operation: In Volta, each Tensor Core performs 64 floating point FMA* operations per clock, and eight Tensor Cores in an SM perform a total of 512 FMA operations *FMA (Fused Multiply-Accumulate) https://www.nvidia.com/ Vectorization like in CPUs
  • 34. GPU Architecture: Tensor Cores How to use/activate them: ▪ The code simply needs to use a flag to tell the API and drivers that you want to use tensor cores, the data type needs to be one supported by the cores, and the dimensions of the matrices need to be a multiple of 8. After that, that hardware will handle everything else. ▪ Easiest way is through cuBLAS or other NVIDIA HPC libraries ▪ For now avoid the API to access them (ugly like CPU-vectorization) Multiple of 8 for tensor cores & Multiple of 32 for vanilla CUDA cores (warp scheduling)
  • 35. // First, create a cuBLAS handle: cublasStatus_t cublasStat = cublasCreate(&handle); // Set the math mode to allow cuBLAS to use Tensor Cores: cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH); // Allocate and initialize your matrices (only the A matrix is shown): size_t matrixSizeA = (size_t)rowsA * colsA; T_ELEM_IN **devPtrA = 0; cudaMalloc((void**)&devPtrA[0], matrixSizeA * sizeof(devPtrA[0][0])); T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA * sizeof(A[0])); memset( A, 0xFF, matrixSizeA* sizeof(A[0])); status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA); // ... allocate and initialize B and C matrices (not shown) ... // Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8, // and m is a multiple of 4: cublasStat = cublasGemmEx( handle, transa, transb, m, n, k, alpha, A, CUDA_R_16F, lda, B, CUDA_R_16F, ldb, beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo); Few Rules • The routine must be a GEMM; currently, only GEMMs support Tensor Core execution. • GEMMs that do not satisfy the rules will fall back to a non-Tensor Core implementation. https://www.nvidia.com/ https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ GPU Architecture: Tensor Cores (cuBLAS example)
  • 36. Demo Time! Future considerations: • Hardware Homogeneity • Fast Network • Parallel Storage