SlideShare a Scribd company logo
Best Practices: Application Profiling at
the HPCAC High Performance Center
Pak Lui
2
• Nekbone
• NEMO
• NWChem
• Octopus
• OpenAtom
• OpenFOAM
• OpenMX
• OptiStruct
• PAM-CRASH / VPS
• PARATEC
• Pretty Fast Analysis
• PFLOTRAN
• Quantum ESPRESSO
• RADIOSS
• HPCC
• HPCG
• HYCOM
• ICON
• Lattice QCD
• LAMMPS
• LS-DYNA
• miniFE
• MILC
• MSC Nastran
• MR Bayes
• MM5
• MPQC
• NAMD
157 Applications Best Practices Published
• Abaqus
• ABySS
• AcuSolve
• Amber
• AMG
• AMR
• ANSYS CFX
• ANSYS Fluent
• ANSYS Mechanical
• BQCD
• BSMBench
• CAM-SE
• CCSM 4.0
• CESM
• COSMO
• CP2K
• CPMD
• Dacapo
• Desmond
• DL-POLY
• Eclipse
• FLOW-3D
• GADGET-2
• Graph500
• GROMACS
• Himeno
• HIT3D
• HOOMD-blue
For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php
• RFD tNavigator
• SNAP
• SPECFEM3D
• STAR-CCM+
• STAR-CD
• VASP
• WRF
3
35 Applications Installation Best Practices Published
• Adaptive Mesh Refinement (AMR)
• Amber (for GPU/CUDA)
• Amber (for CPU)
• ANSYS Fluent 15.0.7
• ANSYS Fluent 17.1
• BQCD
• CASTEP 16.1
• CESM
• CP2K
• CPMD
• DL-POLY 4
• ESI PAM-CRASH 2015.1
• ESI PAM-CRASH / VPS 2013.1
• GADGET-2
• GROMACS 5.1.2
• GROMACS 4.5.4
• GROMACS 5.0.4 (GPU/CUDA)
• Himeno
• HOOMD Blue
• LAMMPS
• LAMMPS-KOKKOS
• LS-DYNA
• MrBayes
• NAMD
For more information, visit: http://www.hpcadvisorycouncil.com/subgroups_hpc_works.php
• NEMO
• NWChem
• Octopus
• OpenFOAM
• OpenMX
• PyFR
• Quantum ESPRESSO 4.1.2
• Quantum ESPRESSO 5.1.1
• Quantum ESPRESSO 5.3.0
• WRF 3.2.1
• WRF 3.8
4
HPC Advisory Council HPC Center
Dell™ Pow erEdge™
R730 GPU
36-node cluster
HPE Cluster Platform
3000SL
16-node cluster
HPE ProLiant SL230s
Gen8
4-node cluster
Dell™ Pow erEdge™ M610
38-node cluster
Dell™ Pow erEdge™ C6100
4-node cluster
IBM POWER8
8-node cluster
Dell™ Pow erEdge™
R720xd/R720 32-node GPU
cluster
Dell Pow erVault MD3420
Dell Pow erVault MD3460
InfiniBand
Storage
(Lustre)
HPE Apollo 6000
10-node cluster
4-node GPU cluster 4-node GPU cluster
Dell™ Pow erEdge™
R815 11-node cluster
Dell™ Pow erEdge™
C6145 6-node cluster
5
Agenda
• Overview of HPC Applications Performance
• Way to Inspect, Profile, Optimize HPC Applications
– CPU, memory, file I/O, network
• System Configurationsand Tuning
• Case Studies, PerformanceComparisons, Optimizations and Highlights
• Conclusions
6
HPC Application Performance Overview
• To achieve scalability performance on HPC applications
– Involves understandingof the workload by performing profile analysis
• Tune for the most time spent (either CPU, Network, IO, etc)
– Underlying implicit requirement: Each node to perform similarly
• Run CPU/memory /network tests or cluster checker to identify bad node(s)
– Comparing behaviors of using different HW components
• Which pinpoint bottlenecks in different areas of the HPC cluster
• A selection of HPC applications will be shown
– To demonstrate method of profiling and analysis
– To determine the bottleneck in SW/HW
– To determine the effectivenessof tuning to improve on performance
7
Ways To Inspect and Profile Applications
• Computation (CPU/Accelerators)
– Tools: gprof, top, htop, perf top, pstack, Visual Profiler, etc
– Tests and Benchmarks: HPL, STREAM
• File I/O
– Bandwidth and Block Size: iostat, collectl, darshan, etc
– Characterization Tools and Benchmarks: iozone, ior, etc
• Network Interconnect and MPI communications
– Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc)
– Characterization Tools and Benchmarks:
– Latency and Bandwidth: OSU benchmarks, IMB
8
Note
• The followingresearch was performed under the HPC Advisory Council activities
– Compute resource - HPC Advisory Council Cluster Center
• The followingwas done to provide best practices
– BSMBench performance overview
– UnderstandingBSMBench communication patterns
– Ways to increase BSMBench productivity
• For more info please refer to
– https://gitlab.com/edbennett/BSMBench
9
BSMBench
• Open source supercomputerbenchmarkingtool
• Based on simulationcode usedfor studying strong interactions in particle physics
• Includes the ability to tune the ratio of communication over computation
• Includes 3 examples that show the performance of the system for
– Problem that is computationally dominated (marked as Communications)
– Problem that is communication dominated (marked as Compute)
– Problem in which communication and computational requirements are balanced (marked as
Balance)
• Used to simulate workload such as Lattice Quantum ChromoDynamics(QCD), and by extension
its parent field, Lattice Gauge Theory (LGT), whichmake up a significant fraction of
supercomputing cycles worldwide
• For reference:technical paper published at the 2016 International Conference on High
Performance Computing & Simulation(HPCS), Innsbruck, Austria, 2016, pp. 834-839
10
Objectives
• The presented research was done to provide best practices
– BSMBench performance benchmarking
• MPI Library performance comparison
• Interconnect performance comparison
• Compilers comparison
• Optimization tuning
• The presented results will demonstrate
– The scalability of the compute environment/application
– Considerations for higher productivity and efficiency
11
Test Cluster Configuration
• Dell PowerEdge R730 32-node (1024-core) “Thor” cluster
– Dual-Socket 16-Core Intel E5-2697Av4 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop)
– Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop
– OS: RHEL 7.2, M MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters
• Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch
• Intel® Omni-Path Host Fabric Interface (HFI) 100Gbps Adapter
• Intel® Omni-Path Edge Switch 100 Series
• Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420
• Compilers: Intel Compilers 2016.4.258
• MPI: Intel ParallelStudio XE 2016 Update 4, Mellanox HPC-X MPI Toolkit v1.8
• Application: BSMBench Version 1.0
• MPI Profiler: IPM (from Mellanox HPC-X)
12
BSMBench Profiling – % of MPI Calls
• Major MPI calls (as % of wall time):
– Balance: MPI_Barrier (26%), MPI_Allreduce (6%), MPI_Waitall (5%), MPI_Isend (4%)
– Communications: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend
(2%)
– Compute: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (1%)
ComputeBalance
32 Nodes / 1024 Processes
Communications
13
BSMBench Profiling – MPI Message Size Distribution
• Similar communication pattern seen acrossall 3 examples:
– Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall
– Communications: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall
– Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall
ComputeBalance Communications
32 Nodes / 1024 Processes
14
BSMBench Profiling – Time Spent in MPI
• The different communications across the MPI processes is mostly balance
– Does not appear to be any significant load imbalances in the communication layer
ComputeBalance Communications
32 Nodes / 1024 Processes
15
BSMBench Performance – Interconnects
• EDR InfiniBand delivers better scalability for BSMBench
– Similar performance between EDR InfiniBand and Omni-Path up to 8 nodes
– Close to 20% performance advantage for InfiniBand at 32 nodes
– Similar performance difference across the three differentexamples
32 MPI Processes / NodeHigher is better
20%
16
BSMBench Performance – MPI Libraries
• Comparison between two commercial available MPI libraries
• Intel MPI and HPC-X delivers similar performance
– HPC-X demonstrates 5% advantage at 32 nodes
32 MPI Processes / NodeHigher is better
17
Test Cluster Configuration
• IBM OperPOWER 8-node “Telesto” cluster
• IBM Power System S822LC (8335-GTA)
– IBM: Dual-Socket 10-Core @ 3.491 GHz CPUs, Memory: 256GB memory, DDR3 PC3-14900 MHz
• Wistron OpenPOWER servers
– Wistron: Dual-Socket 8-Core @ 3.867 GHz CPUs. Memory: 224GB memory, DDR3 PC3-14900 MHz
• OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters
• Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch
• Compilers: GNU compilers 4.8.5, IBM XL Compilers 13.1.3
• MPI: Open MPI 2.0.2, IBM Spectrum MPI 10.1.0.2
• Application: BSMBench Version 1.0
18
BSMBench Performance – SMT
• Simultaneous Multithreading (SMT) allows additional hardware threads for compute
• Additional performance gain is seen with SMT enabled
– Up to 23% of performance gain is seen between no SMT versus 4 SMT threads are used
Higher is better
19
BSMBench Performance – MPI Libraries
• Spectrum MPI with MXM support delivers higher performance
– Spectrum MPI provides MXM and PAMI protocolfor transport/communications
– Up to 19% of higher performance at 4 nodes / 64 cores using Spectrum MPI / MXM
32 MPI Processes / NodeHigher is better
19%
20
BSMBench Performance – CPU Architecture
• IBM architecture demonstrates better performance compared to Intel architecture
– Performance gain on a single node is ~23% for Comms and Balance
– Additional gains are seen when more SMT hardware threads are used
– 32 cores per node used for Intel, versus 16 cores used per node for IBM
Higher is better
21
BSMBench Summary
• Benchmark for BSM Lattice Physics
– Utilizes both compute and network communications
• MPI Profiling
– Most MPI time is spent on MPI collective operations and non-blocking
communications
• Heavy use of MPI collective operations (MPI_Allreduce, MPI_Barrier)
– Similarcommunication patterns seen across all three examples
• Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall
• Comms: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall
• Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall
Fast network communication is important for scalability
22
BSMBench Summary
• Interconnect comparison
– EDR InfiniBand demonstrates higher scalability beyond 16 nodes as compared to
Omni-Path
– EDR InfiniBand delivers nearly 20% higher performance 32 nodes / 1024 cores
– Similar performance advantage across all three example cases
• Simultaneous Multithreading (SMT) provides additional benefit for compute
– Up to 23% of performance gain is seen between no SMT versus 4 SMT threads are
used
• IBM CPU outperforms Intel by approximately23%
– 32 cores per node used for Intel, versus 16 cores used per node for IBM
• SpectrumMPI provides MXM and PAMI protocol for
transport/communications
– Up to 19% of higher performance at 4 nodes / 64 cores using Spectrum MPI / MXM
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information
contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You

More Related Content

Similar to Application Profiling at the HPCAC High Performance Center

HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
inside-BigData.com
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
Esteban Hernandez
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
Koan-Sin Tan
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
AcuSolve Optimizations for Scale - Hpc advisory council
AcuSolve Optimizations for Scale   - Hpc advisory councilAcuSolve Optimizations for Scale   - Hpc advisory council
AcuSolve Optimizations for Scale - Hpc advisory council
Altair
 
UCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond
Ed Dodds
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
Shivalik college of engineering
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
CSC - IT Center for Science
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
Presentation mongo db munich
Presentation mongo db munichPresentation mongo db munich
Presentation mongo db munich
MongoDB
 
NSCC Training Introductory Class
NSCC Training  Introductory ClassNSCC Training  Introductory Class
NSCC Training Introductory Class
National Supercomputing Centre Singapore
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
Intel® Software
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 

Similar to Application Profiling at the HPCAC High Performance Center (20)

HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
AcuSolve Optimizations for Scale - Hpc advisory council
AcuSolve Optimizations for Scale   - Hpc advisory councilAcuSolve Optimizations for Scale   - Hpc advisory council
AcuSolve Optimizations for Scale - Hpc advisory council
 
UCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Presentation mongo db munich
Presentation mongo db munichPresentation mongo db munich
Presentation mongo db munich
 
NSCC Training Introductory Class
NSCC Training  Introductory ClassNSCC Training  Introductory Class
NSCC Training Introductory Class
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 

Recently uploaded (20)

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 

Application Profiling at the HPCAC High Performance Center

  • 1. Best Practices: Application Profiling at the HPCAC High Performance Center Pak Lui
  • 2. 2 • Nekbone • NEMO • NWChem • Octopus • OpenAtom • OpenFOAM • OpenMX • OptiStruct • PAM-CRASH / VPS • PARATEC • Pretty Fast Analysis • PFLOTRAN • Quantum ESPRESSO • RADIOSS • HPCC • HPCG • HYCOM • ICON • Lattice QCD • LAMMPS • LS-DYNA • miniFE • MILC • MSC Nastran • MR Bayes • MM5 • MPQC • NAMD 157 Applications Best Practices Published • Abaqus • ABySS • AcuSolve • Amber • AMG • AMR • ANSYS CFX • ANSYS Fluent • ANSYS Mechanical • BQCD • BSMBench • CAM-SE • CCSM 4.0 • CESM • COSMO • CP2K • CPMD • Dacapo • Desmond • DL-POLY • Eclipse • FLOW-3D • GADGET-2 • Graph500 • GROMACS • Himeno • HIT3D • HOOMD-blue For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php • RFD tNavigator • SNAP • SPECFEM3D • STAR-CCM+ • STAR-CD • VASP • WRF
  • 3. 3 35 Applications Installation Best Practices Published • Adaptive Mesh Refinement (AMR) • Amber (for GPU/CUDA) • Amber (for CPU) • ANSYS Fluent 15.0.7 • ANSYS Fluent 17.1 • BQCD • CASTEP 16.1 • CESM • CP2K • CPMD • DL-POLY 4 • ESI PAM-CRASH 2015.1 • ESI PAM-CRASH / VPS 2013.1 • GADGET-2 • GROMACS 5.1.2 • GROMACS 4.5.4 • GROMACS 5.0.4 (GPU/CUDA) • Himeno • HOOMD Blue • LAMMPS • LAMMPS-KOKKOS • LS-DYNA • MrBayes • NAMD For more information, visit: http://www.hpcadvisorycouncil.com/subgroups_hpc_works.php • NEMO • NWChem • Octopus • OpenFOAM • OpenMX • PyFR • Quantum ESPRESSO 4.1.2 • Quantum ESPRESSO 5.1.1 • Quantum ESPRESSO 5.3.0 • WRF 3.2.1 • WRF 3.8
  • 4. 4 HPC Advisory Council HPC Center Dell™ Pow erEdge™ R730 GPU 36-node cluster HPE Cluster Platform 3000SL 16-node cluster HPE ProLiant SL230s Gen8 4-node cluster Dell™ Pow erEdge™ M610 38-node cluster Dell™ Pow erEdge™ C6100 4-node cluster IBM POWER8 8-node cluster Dell™ Pow erEdge™ R720xd/R720 32-node GPU cluster Dell Pow erVault MD3420 Dell Pow erVault MD3460 InfiniBand Storage (Lustre) HPE Apollo 6000 10-node cluster 4-node GPU cluster 4-node GPU cluster Dell™ Pow erEdge™ R815 11-node cluster Dell™ Pow erEdge™ C6145 6-node cluster
  • 5. 5 Agenda • Overview of HPC Applications Performance • Way to Inspect, Profile, Optimize HPC Applications – CPU, memory, file I/O, network • System Configurationsand Tuning • Case Studies, PerformanceComparisons, Optimizations and Highlights • Conclusions
  • 6. 6 HPC Application Performance Overview • To achieve scalability performance on HPC applications – Involves understandingof the workload by performing profile analysis • Tune for the most time spent (either CPU, Network, IO, etc) – Underlying implicit requirement: Each node to perform similarly • Run CPU/memory /network tests or cluster checker to identify bad node(s) – Comparing behaviors of using different HW components • Which pinpoint bottlenecks in different areas of the HPC cluster • A selection of HPC applications will be shown – To demonstrate method of profiling and analysis – To determine the bottleneck in SW/HW – To determine the effectivenessof tuning to improve on performance
  • 7. 7 Ways To Inspect and Profile Applications • Computation (CPU/Accelerators) – Tools: gprof, top, htop, perf top, pstack, Visual Profiler, etc – Tests and Benchmarks: HPL, STREAM • File I/O – Bandwidth and Block Size: iostat, collectl, darshan, etc – Characterization Tools and Benchmarks: iozone, ior, etc • Network Interconnect and MPI communications – Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc) – Characterization Tools and Benchmarks: – Latency and Bandwidth: OSU benchmarks, IMB
  • 8. 8 Note • The followingresearch was performed under the HPC Advisory Council activities – Compute resource - HPC Advisory Council Cluster Center • The followingwas done to provide best practices – BSMBench performance overview – UnderstandingBSMBench communication patterns – Ways to increase BSMBench productivity • For more info please refer to – https://gitlab.com/edbennett/BSMBench
  • 9. 9 BSMBench • Open source supercomputerbenchmarkingtool • Based on simulationcode usedfor studying strong interactions in particle physics • Includes the ability to tune the ratio of communication over computation • Includes 3 examples that show the performance of the system for – Problem that is computationally dominated (marked as Communications) – Problem that is communication dominated (marked as Compute) – Problem in which communication and computational requirements are balanced (marked as Balance) • Used to simulate workload such as Lattice Quantum ChromoDynamics(QCD), and by extension its parent field, Lattice Gauge Theory (LGT), whichmake up a significant fraction of supercomputing cycles worldwide • For reference:technical paper published at the 2016 International Conference on High Performance Computing & Simulation(HPCS), Innsbruck, Austria, 2016, pp. 834-839
  • 10. 10 Objectives • The presented research was done to provide best practices – BSMBench performance benchmarking • MPI Library performance comparison • Interconnect performance comparison • Compilers comparison • Optimization tuning • The presented results will demonstrate – The scalability of the compute environment/application – Considerations for higher productivity and efficiency
  • 11. 11 Test Cluster Configuration • Dell PowerEdge R730 32-node (1024-core) “Thor” cluster – Dual-Socket 16-Core Intel E5-2697Av4 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop) – Memory: 256GB memory, DDR4 2400 MHz, Memory Snoop Mode in BIOS sets to Home Snoop – OS: RHEL 7.2, M MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack • Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters • Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch • Intel® Omni-Path Host Fabric Interface (HFI) 100Gbps Adapter • Intel® Omni-Path Edge Switch 100 Series • Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 • Compilers: Intel Compilers 2016.4.258 • MPI: Intel ParallelStudio XE 2016 Update 4, Mellanox HPC-X MPI Toolkit v1.8 • Application: BSMBench Version 1.0 • MPI Profiler: IPM (from Mellanox HPC-X)
  • 12. 12 BSMBench Profiling – % of MPI Calls • Major MPI calls (as % of wall time): – Balance: MPI_Barrier (26%), MPI_Allreduce (6%), MPI_Waitall (5%), MPI_Isend (4%) – Communications: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (2%) – Compute: MPI_Barrier (14%), MPI_Allreduce (5%), MPI_Waitall (5%), MPI_Isend (1%) ComputeBalance 32 Nodes / 1024 Processes Communications
  • 13. 13 BSMBench Profiling – MPI Message Size Distribution • Similar communication pattern seen acrossall 3 examples: – Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall – Communications: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall – Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall ComputeBalance Communications 32 Nodes / 1024 Processes
  • 14. 14 BSMBench Profiling – Time Spent in MPI • The different communications across the MPI processes is mostly balance – Does not appear to be any significant load imbalances in the communication layer ComputeBalance Communications 32 Nodes / 1024 Processes
  • 15. 15 BSMBench Performance – Interconnects • EDR InfiniBand delivers better scalability for BSMBench – Similar performance between EDR InfiniBand and Omni-Path up to 8 nodes – Close to 20% performance advantage for InfiniBand at 32 nodes – Similar performance difference across the three differentexamples 32 MPI Processes / NodeHigher is better 20%
  • 16. 16 BSMBench Performance – MPI Libraries • Comparison between two commercial available MPI libraries • Intel MPI and HPC-X delivers similar performance – HPC-X demonstrates 5% advantage at 32 nodes 32 MPI Processes / NodeHigher is better
  • 17. 17 Test Cluster Configuration • IBM OperPOWER 8-node “Telesto” cluster • IBM Power System S822LC (8335-GTA) – IBM: Dual-Socket 10-Core @ 3.491 GHz CPUs, Memory: 256GB memory, DDR3 PC3-14900 MHz • Wistron OpenPOWER servers – Wistron: Dual-Socket 8-Core @ 3.867 GHz CPUs. Memory: 224GB memory, DDR3 PC3-14900 MHz • OS: RHEL 7.2, MLNX_OFED_LINUX-3.4-1.0.0.0 InfiniBand SW stack • Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters • Mellanox Switch-IB SB7800 36-port EDR 100Gb/s InfiniBand Switch • Compilers: GNU compilers 4.8.5, IBM XL Compilers 13.1.3 • MPI: Open MPI 2.0.2, IBM Spectrum MPI 10.1.0.2 • Application: BSMBench Version 1.0
  • 18. 18 BSMBench Performance – SMT • Simultaneous Multithreading (SMT) allows additional hardware threads for compute • Additional performance gain is seen with SMT enabled – Up to 23% of performance gain is seen between no SMT versus 4 SMT threads are used Higher is better
  • 19. 19 BSMBench Performance – MPI Libraries • Spectrum MPI with MXM support delivers higher performance – Spectrum MPI provides MXM and PAMI protocolfor transport/communications – Up to 19% of higher performance at 4 nodes / 64 cores using Spectrum MPI / MXM 32 MPI Processes / NodeHigher is better 19%
  • 20. 20 BSMBench Performance – CPU Architecture • IBM architecture demonstrates better performance compared to Intel architecture – Performance gain on a single node is ~23% for Comms and Balance – Additional gains are seen when more SMT hardware threads are used – 32 cores per node used for Intel, versus 16 cores used per node for IBM Higher is better
  • 21. 21 BSMBench Summary • Benchmark for BSM Lattice Physics – Utilizes both compute and network communications • MPI Profiling – Most MPI time is spent on MPI collective operations and non-blocking communications • Heavy use of MPI collective operations (MPI_Allreduce, MPI_Barrier) – Similarcommunication patterns seen across all three examples • Balance: MPI_Barrier: 0-byte, 22% wall, MPI_Allreduce: 8-byte, 5% wall • Comms: MPI_Barrier: 0-byte, 26% wall, MPI_Allreduce: 8-byte, 5% wall • Compute: MPI_Barrier: 0-byte, 13% wall, MPI_Allreduce: 8-byte, 5% wall Fast network communication is important for scalability
  • 22. 22 BSMBench Summary • Interconnect comparison – EDR InfiniBand demonstrates higher scalability beyond 16 nodes as compared to Omni-Path – EDR InfiniBand delivers nearly 20% higher performance 32 nodes / 1024 cores – Similar performance advantage across all three example cases • Simultaneous Multithreading (SMT) provides additional benefit for compute – Up to 23% of performance gain is seen between no SMT versus 4 SMT threads are used • IBM CPU outperforms Intel by approximately23% – 32 cores per node used for Intel, versus 16 cores used per node for IBM • SpectrumMPI provides MXM and PAMI protocol for transport/communications – Up to 19% of higher performance at 4 nodes / 64 cores using Spectrum MPI / MXM
  • 23. All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein Thank You