SlideShare a Scribd company logo
Craig Tierney, Solution Architect, NVIDIA
August 15, 2017
TUNING MVAPICH2 FOR MULTI-
RAIL INFININBAND AND NVLINK
2
Agenda
• Overview
• Multi-rail
• Intranode performance with NVLink
3
GPUDIRECT FAMILY1
Technologies, enabling products !!!
GPU pinned memory shared with other
RDMA capable devices
Avoids intermediate copies
GPUDIRECT SHARED GPU-
SYSMEM
Accelerated GPU-GPU memory copies
Inter-GPU direct load/store access
GPUDIRECT P2P
Direct GPU to 3rd party device transfers
E.g. direct I/O, optimized inter-node
communication
GPUDIRECT RDMA2
Direct GPU to 3rd party device synchronizations
E.g. optimized inter-node communication
GPUDIRECT ASYNC
[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma
4
GPUDIRECT FAMILY1
Technologies, enabling products !!!
GPU pinned memory shared with other
RDMA capable devices
Avoids intermediate copies
GPUDIRECT SHARED GPU-
SYSMEM
Accelerated GPU-GPU memory copies
Inter-GPU direct load/store access
GPUDIRECT P2P
Direct GPU to 3rd party device transfers
E.g. direct I/O, optimized inter-node
communication
GPUDIRECT RDMA2
Direct GPU to 3rd party device synchronizations
E.g. optimized inter-node communication
GPUDIRECT ASYNC
[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma
Each platform has different performance characteristics, unique tuning required
5
OVERVIEW
• GPUs provide significantly more performance than CPUs for many HPC workloads
Application Domain CPU Node Equivalence
HOOMD-Blue Particle Dynamics 31
SPECFEM 3D Seismic Wave Propagation 78
VASP Quantum Chemistry 24
LSMS Materials Code 36
QUDA Lattice Quantum Chromo Dynamics 71
Number of nodes necessary to achieve same performance of 8-way P100 server.
http://www.nvidia.com/object/application-performance-guide.html
6
OVERVIEW
• GPUs provide significantly more performance than CPUs for many HPC workloads
Application Domain CPU Node Equivalence
HOOMD-Blue Particle Dynamics 31
SPECFEM 3D Seismic Wave Propagation 78
VASP Quantum Chemistry 24
LSMS Materials Code 36
QUDA Lattice Quantum Chromo Dynamics 71
Number of nodes necessary to achieve same performance of 8-way P100 server.
http://www.nvidia.com/object/application-performance-guide.html
Takeaway:
Network communications have to be highly efficient to
applications to scale
Single-rail networks may not be enough
7
MULTI-RAIL APPLICATION PERFORMANCE
1.20
1.39
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
1 2 3 4
RelativePerformance
Number of Rails
HPL Relative Performance (124 DGX Servers)
8
TEST CONFIGURATION
• Servers
• 8 NVIDIA P100 GPUs /w NVLink
• Intel Broadwell E5 v2697
• Quad-rail Mellanox EDR Infiniband
• Switch
• Mellanox SB7790, 36-port EDR switch
• MPI – Mvpaich2-GDR 2.2 w/ GCC
9
TEST CONFIGURATION
• Servers
• 8 NVIDIA P100 GPUs /w NVLink
• Intel Broadwell E5 v2697
• Quad-rail Mellanox EDR Infiniband
• Switch
• Mellanox SB7790, 36-port EDR switch
• MPI – Mvpaich2-GDR 2.2 w/ GCC
Results presented here are intended to show general trends and not guarantee
performance for any particular system or configuration.
10
NVLINK
• NVLink is an energy-efficient, high-bandwidth
path between the GPUs and the CPU
• NVLink Gen 1, Provides up to 160 GB/s
• NVLink Gen 2, Provides up to 300 GB/s
• IBM Power 8 (and future) systems connect
GPUs directly to the CPU, removing the need
for PCIe to communicate to the GPU
11
SYSTEM ARCHITECTURE DIAGRAMS W/ NVLINK
GPU1 GPU2
GPU0 GPU3
CPU0 CPU1
GPU0 GPU1
CPU0 CPU1
Ex: Supermicro 1028GQ-TRX IBM Power8 Minsky
GPU0 GPU1
12
SYSTEM ARCHITECTURE DIAGRAMS W/ NVLINK
GPU0 GPU1
GPU2 GPU3
CPU0
GPU5 GPU4
GPU7 GPU6
CPU1
EX: Nvidia DGX Server
13
Agenda
• Overview
• Multi-rail
• Intranode performance with NVLink
14
IB PERFORMANCE, OSU_BW
15
IB PERFORMANCE, OSU_BW
Single-rail limit at 256k
Caused by GDRCopy
being enabled
16
IB PERFORMANCE, OSU_BW
17
IB PERFORMANCE, OSU_BIBW
18
RANK LAYOUT ON MULTI-RAIL GPU SYSTEMS
• While multi-rail performance is impressive, in practice there will be multiple
ranks per node, most likely 1 rank per GPU
• A better test is to measure bandwidth is to have one MPI rank per GPU
• This is the OSU_MBW_MR test
• The test has been modified to support device-to-device (DtD) as well as host-to-
host (HtH) transfers
19
IB PERFORMANCE, OSU_MBW_MR
20
IB PERFORMANCE, OSU_MBW_MR
Dip in
performance
caused by lack
of tuning
21
IB PERFORMANCE, OSU_MBW_MR
With affinity and
tuning, DtD
performance
nearly matches
HtH
22
TUNABLE PARAMETERS
MVAPICH2 Tunable Default
Value
Proposed
Value
Reason
MV2_CUDA_IPC_THESHOLD 32768 262144 Improve DtD Intranode transfers
at 32K and above.
MV2_USE_GPUDIRECT_GDRCOPY_
LIMIT
8192 32768 Improve HtD Intranode transfers
between 16K and 32K
MV2_GPUDIRECT_LIMIT 8192 4194304* Improve Internode transfers
MV2_USE_SMP_GDR 1 0 Offset performance degradation
to Intranode transfers when
setting MV2_GPDIRECT_LIMIT
MV2_GPUDIRECT_RECEIVE_LIMIT 131072 4194304* Improve all Intranode transfers
at 1M and above
* May need to be larger depending on your largest transfer.
Note, the search of this space was not exhaustive. More optimization can be done.
23
Agenda
• Overview
• Multi-rail
• Intranode performance with NVLink
24
INTRANODE MPI PERFORMANCE
• With the tunables above, what is the performance of intranode transfers?
• Testing pairs:
• GPUs 0 and 1 – NVLink connected, Same PCIe switch
• GPUs 0 and 4 – NVLink connected, Different CPU socket
• GPUs 0 and 6 – Not NVLink connected, Different CPU socket
25
INTRANODE TRANSFERS
26
INTRANODE TRANSFERS
27
INTRANODE TRANSFERS - DEFAULTS
28
INTRANODE TRANSFERS - TUNED
29
INTRANODE TRANSFERS - DEFAULTS
30
INTRANODE TRANSFERS - TUNED
31
INTRANODE TRANSFERS - DEFAULTS
32
INTRANODE TRANSFERS - TUNED
33
INTRANODE TRANSFERS – BY GPU PAIR
34
INTRANODE TRANSFERS – BY GPU PAIR
35
INTRANODE TRANSFERS – BY GPU PAIR
36
INTRANODE TRANSFERS – BY GPU PAIR
Latency is similar for all cases as loopback IB is used
37
SUMMARY
• Mvapich2 is already very efficient with dual-rail and quad-rail configurations
• Setting affinity properly for HCAs and GPUs is key to maximize total node BW
• NVLink bandwidth provides great potential for improving scalability over PCIe
• MPI point-to-point message passing can be greatly improved with the tunables
provided in the Mvapich2 stack
• More tuning will improve results
• More tuning may discover parameters need to be scoped for intra- and inter- node
transfers differently
• Results should be rerun with mvpaich2-2.3a to see what changes have been made
Thank you!
39
#!/bin/bash
if [ $MV2_COMM_WORLD_LOCAL_RANK -eq 0
]; then
export LOCAL_RANK=$1
elif [ $MV2_COMM_WORLD_LOCAL_RANK -eq
1 ]; then
export LOCAL_RANK=$2
else
echo "THIS TOOL ONLY SUPPORTS 2
RANKS, EXITING"
exit
fi
shift
shift
case ${LOCAL_RANK} in
[0])
export MV2_IBA_HCA=mlx5_0
export CORELIST=0-4
;;
[1])
export MV2_IBA_HCA=mlx5_0
export CORELIST=5-9
;;
[2])
export MV2_IBA_HCA=mlx5_1
export CORELIST=10-14
;;
…
[6])
export MV2_IBA_HCA=mlx5_3
export CORELIST=30-34
;;
[7])
export MV2_IBA_HCA=mlx5_3
export CORELIST=35-39
;;
esac
export MV2_ENABLE_AFFINITY=0
numactl --physcpubind=$CORELIST $*

More Related Content

What's hot

CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
inside-BigData.com
 

What's hot (20)

X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
 X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
 
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World WorkloadsSupermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
Understanding MicroK8s as a Beginner.pdf
Understanding MicroK8s as a Beginner.pdfUnderstanding MicroK8s as a Beginner.pdf
Understanding MicroK8s as a Beginner.pdf
 
New Accelerated Compute Infrastructure Solutions from Supermicro
New Accelerated Compute Infrastructure Solutions from SupermicroNew Accelerated Compute Infrastructure Solutions from Supermicro
New Accelerated Compute Infrastructure Solutions from Supermicro
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
Turtlebot3: VxWorks running ROS2 as a real-time guest OS on Hypervisor
Turtlebot3: VxWorks running ROS2 as a real-time guest OS on HypervisorTurtlebot3: VxWorks running ROS2 as a real-time guest OS on Hypervisor
Turtlebot3: VxWorks running ROS2 as a real-time guest OS on Hypervisor
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
Mininet multiple controller
Mininet   multiple controllerMininet   multiple controller
Mininet multiple controller
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architecture
 

Similar to Benefits of Multi-rail Cluster Architectures for GPU-based Nodes

Similar to Benefits of Multi-rail Cluster Architectures for GPU-based Nodes (20)

2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
qCUDA-ARM : Virtualization for Embedded GPU Architectures
 qCUDA-ARM : Virtualization for Embedded GPU Architectures  qCUDA-ARM : Virtualization for Embedded GPU Architectures
qCUDA-ARM : Virtualization for Embedded GPU Architectures
 
Onboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareOnboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking Software
 
NDI network switch recommendations
NDI network switch recommendationsNDI network switch recommendations
NDI network switch recommendations
 
BUD17-214: Bus scaling QoS update
BUD17-214: Bus scaling QoS update BUD17-214: Bus scaling QoS update
BUD17-214: Bus scaling QoS update
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 

Benefits of Multi-rail Cluster Architectures for GPU-based Nodes

  • 1. Craig Tierney, Solution Architect, NVIDIA August 15, 2017 TUNING MVAPICH2 FOR MULTI- RAIL INFININBAND AND NVLINK
  • 2. 2 Agenda • Overview • Multi-rail • Intranode performance with NVLink
  • 3. 3 GPUDIRECT FAMILY1 Technologies, enabling products !!! GPU pinned memory shared with other RDMA capable devices Avoids intermediate copies GPUDIRECT SHARED GPU- SYSMEM Accelerated GPU-GPU memory copies Inter-GPU direct load/store access GPUDIRECT P2P Direct GPU to 3rd party device transfers E.g. direct I/O, optimized inter-node communication GPUDIRECT RDMA2 Direct GPU to 3rd party device synchronizations E.g. optimized inter-node communication GPUDIRECT ASYNC [1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma
  • 4. 4 GPUDIRECT FAMILY1 Technologies, enabling products !!! GPU pinned memory shared with other RDMA capable devices Avoids intermediate copies GPUDIRECT SHARED GPU- SYSMEM Accelerated GPU-GPU memory copies Inter-GPU direct load/store access GPUDIRECT P2P Direct GPU to 3rd party device transfers E.g. direct I/O, optimized inter-node communication GPUDIRECT RDMA2 Direct GPU to 3rd party device synchronizations E.g. optimized inter-node communication GPUDIRECT ASYNC [1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma Each platform has different performance characteristics, unique tuning required
  • 5. 5 OVERVIEW • GPUs provide significantly more performance than CPUs for many HPC workloads Application Domain CPU Node Equivalence HOOMD-Blue Particle Dynamics 31 SPECFEM 3D Seismic Wave Propagation 78 VASP Quantum Chemistry 24 LSMS Materials Code 36 QUDA Lattice Quantum Chromo Dynamics 71 Number of nodes necessary to achieve same performance of 8-way P100 server. http://www.nvidia.com/object/application-performance-guide.html
  • 6. 6 OVERVIEW • GPUs provide significantly more performance than CPUs for many HPC workloads Application Domain CPU Node Equivalence HOOMD-Blue Particle Dynamics 31 SPECFEM 3D Seismic Wave Propagation 78 VASP Quantum Chemistry 24 LSMS Materials Code 36 QUDA Lattice Quantum Chromo Dynamics 71 Number of nodes necessary to achieve same performance of 8-way P100 server. http://www.nvidia.com/object/application-performance-guide.html Takeaway: Network communications have to be highly efficient to applications to scale Single-rail networks may not be enough
  • 7. 7 MULTI-RAIL APPLICATION PERFORMANCE 1.20 1.39 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1 2 3 4 RelativePerformance Number of Rails HPL Relative Performance (124 DGX Servers)
  • 8. 8 TEST CONFIGURATION • Servers • 8 NVIDIA P100 GPUs /w NVLink • Intel Broadwell E5 v2697 • Quad-rail Mellanox EDR Infiniband • Switch • Mellanox SB7790, 36-port EDR switch • MPI – Mvpaich2-GDR 2.2 w/ GCC
  • 9. 9 TEST CONFIGURATION • Servers • 8 NVIDIA P100 GPUs /w NVLink • Intel Broadwell E5 v2697 • Quad-rail Mellanox EDR Infiniband • Switch • Mellanox SB7790, 36-port EDR switch • MPI – Mvpaich2-GDR 2.2 w/ GCC Results presented here are intended to show general trends and not guarantee performance for any particular system or configuration.
  • 10. 10 NVLINK • NVLink is an energy-efficient, high-bandwidth path between the GPUs and the CPU • NVLink Gen 1, Provides up to 160 GB/s • NVLink Gen 2, Provides up to 300 GB/s • IBM Power 8 (and future) systems connect GPUs directly to the CPU, removing the need for PCIe to communicate to the GPU
  • 11. 11 SYSTEM ARCHITECTURE DIAGRAMS W/ NVLINK GPU1 GPU2 GPU0 GPU3 CPU0 CPU1 GPU0 GPU1 CPU0 CPU1 Ex: Supermicro 1028GQ-TRX IBM Power8 Minsky GPU0 GPU1
  • 12. 12 SYSTEM ARCHITECTURE DIAGRAMS W/ NVLINK GPU0 GPU1 GPU2 GPU3 CPU0 GPU5 GPU4 GPU7 GPU6 CPU1 EX: Nvidia DGX Server
  • 13. 13 Agenda • Overview • Multi-rail • Intranode performance with NVLink
  • 15. 15 IB PERFORMANCE, OSU_BW Single-rail limit at 256k Caused by GDRCopy being enabled
  • 18. 18 RANK LAYOUT ON MULTI-RAIL GPU SYSTEMS • While multi-rail performance is impressive, in practice there will be multiple ranks per node, most likely 1 rank per GPU • A better test is to measure bandwidth is to have one MPI rank per GPU • This is the OSU_MBW_MR test • The test has been modified to support device-to-device (DtD) as well as host-to- host (HtH) transfers
  • 20. 20 IB PERFORMANCE, OSU_MBW_MR Dip in performance caused by lack of tuning
  • 21. 21 IB PERFORMANCE, OSU_MBW_MR With affinity and tuning, DtD performance nearly matches HtH
  • 22. 22 TUNABLE PARAMETERS MVAPICH2 Tunable Default Value Proposed Value Reason MV2_CUDA_IPC_THESHOLD 32768 262144 Improve DtD Intranode transfers at 32K and above. MV2_USE_GPUDIRECT_GDRCOPY_ LIMIT 8192 32768 Improve HtD Intranode transfers between 16K and 32K MV2_GPUDIRECT_LIMIT 8192 4194304* Improve Internode transfers MV2_USE_SMP_GDR 1 0 Offset performance degradation to Intranode transfers when setting MV2_GPDIRECT_LIMIT MV2_GPUDIRECT_RECEIVE_LIMIT 131072 4194304* Improve all Intranode transfers at 1M and above * May need to be larger depending on your largest transfer. Note, the search of this space was not exhaustive. More optimization can be done.
  • 23. 23 Agenda • Overview • Multi-rail • Intranode performance with NVLink
  • 24. 24 INTRANODE MPI PERFORMANCE • With the tunables above, what is the performance of intranode transfers? • Testing pairs: • GPUs 0 and 1 – NVLink connected, Same PCIe switch • GPUs 0 and 4 – NVLink connected, Different CPU socket • GPUs 0 and 6 – Not NVLink connected, Different CPU socket
  • 36. 36 INTRANODE TRANSFERS – BY GPU PAIR Latency is similar for all cases as loopback IB is used
  • 37. 37 SUMMARY • Mvapich2 is already very efficient with dual-rail and quad-rail configurations • Setting affinity properly for HCAs and GPUs is key to maximize total node BW • NVLink bandwidth provides great potential for improving scalability over PCIe • MPI point-to-point message passing can be greatly improved with the tunables provided in the Mvapich2 stack • More tuning will improve results • More tuning may discover parameters need to be scoped for intra- and inter- node transfers differently • Results should be rerun with mvpaich2-2.3a to see what changes have been made
  • 39. 39 #!/bin/bash if [ $MV2_COMM_WORLD_LOCAL_RANK -eq 0 ]; then export LOCAL_RANK=$1 elif [ $MV2_COMM_WORLD_LOCAL_RANK -eq 1 ]; then export LOCAL_RANK=$2 else echo "THIS TOOL ONLY SUPPORTS 2 RANKS, EXITING" exit fi shift shift case ${LOCAL_RANK} in [0]) export MV2_IBA_HCA=mlx5_0 export CORELIST=0-4 ;; [1]) export MV2_IBA_HCA=mlx5_0 export CORELIST=5-9 ;; [2]) export MV2_IBA_HCA=mlx5_1 export CORELIST=10-14 ;; … [6]) export MV2_IBA_HCA=mlx5_3 export CORELIST=30-34 ;; [7]) export MV2_IBA_HCA=mlx5_3 export CORELIST=35-39 ;; esac export MV2_ENABLE_AFFINITY=0 numactl --physcpubind=$CORELIST $*