SlideShare a Scribd company logo
HPC Clusters:
Best Practices and
Performance Study
Agenda
 HPC at HPE
 System Configuration and Tuning
 Best Practices for Building Applications
 Intel Xeon Processors
 Efficient Methods in Executing Applications
 Tools and Techniques for Boosting Performance
 Application Performance Highlights
 Conclusions
2
HPC at HPE
3
HPE’s HPC Market and Share
4
HPE/HP, 33.7%
Dell, 17.4%Lenovo, 12.9%
IBM, 3.9%
Cray,
2.4%
SGI,
2.4%Sugon
(Dawning) ,
2.7%
Fujitsu, 1.3%
NEC, 1.2%
Bull Atos,
0.9%
Other, 16.0%
Wuxi,
5.4%
IDC HPC Market Share 2016
Top500 List (ISC2016, June 2016)
System Configuration and Tuning
5
Typical BIOS Settings: Processor Options
– Hyperthreading Options Disabled : Better scaling for HPC workloads
– Processor Core Disable 0 : Enables all available cores
– Intel Turbo Boost Technology Enabled : Increases clock frequency (increase affected by factors).
– ACPI SLIT Preferences Enabled : OS can improve performance by efficient allocation of
resources among processor, memory and I/O subsystems.
– QPI Snoop Configuration Home/Early/COD : Experiment and set the right configuration for your workload.
Home: High Memory Bandwidth for average NUMA workloads.
COD: Cluster On Die, Increased Memory Bandwidth for
optimized and aggressive NUMA workloads.
Early: Decreases latency but may also decrease memory
bandwidth compared to other two modes.
6
Typical BIOS Settings: Power Settings and Management
– HPE Power Profile should be set to Maximum Performance to get best performance (idle and average power
will increase significantly).
– Custom Power Profile will reduce idle and average power at the expense of 1-2% performance reduction.
– To get highest Turbo clock speeds (when partial cores are used), use Power Savings Settings.
– For Custom Power Profile, you will have to set the following additional settings:
7
Best Practices for Building Applications
8
9
-O2 enable optimizations ( = -O, Default)
-O1 optimize for maximum speed, but disable some optimizations which increases code
size for small speed benefit
-O3 enable –O2 plus more aggressive optimizations that may or may not improve
performance for all programs.
-fast enable –O3 –ipo –static
-xHOST optimize code based on the native node used for compilation
-xAVX enable advanced vector instructions set (for Ivy Bridge performance)
-xCORE-AVX2 enable advanced vector instructions set 2 (key Haswell/Broadwell performance)
-xMIC-AVX512 enable advanced vector instructions set 512 (for future KNL/SkyLake based systems)
-mp maintain floating-point precision (disables some optimizations)
-parallel enable the auto parallelizer to generate multi-threaded code
-openmp generate multi-threaded parallel code based on OpenMP directives
-ftz enable/disable flushing denormalized results to zero
-opt-streaming-stores [always auto never] generates streaming stores
-mcmodel=[small medium large] controls the code and data memory allocation
-fp-model=[fast precise source strict] controls floating point model variation
-mkl=[parallel sequential cluster] link to Intel MKL Lib. to build optimized code.
Building Applications: Intel Compiler Flags
10
Building Applications: Compiling Thread Parallel Codes
pgf90 –mp –O3 –Mextend –Mcache_align –k8-64 ftn.f
pathf90 –mp –O3 –extend_source –march=opteron ftn.f
ifort –openmp –O3 -132 –i_dynamic –ftz –IPF_fma ftn.f
pgcc –mp –O3 –Mcache_align –k8-64 code.c
opencc –mp –O3 –march=opteron code.c
icc –openmp –O3 –i_dynamic –ftz –IPF_fma code.c
Combination Flags
Intel: -fast => -O3 –ipo –static
PGI: -fast => -O2 –Munroll –Mnoframe
Open64: -Ofast => -O3 -ipa -OPT:Ofast -fno-math-errno
Notes:
• Must compile and link with –mp / –openmp
• Aggressive optimizations may compromise accuracy
11
mpicc C compiler wrapper to build parallel code
mpiCC C++ compiler wrapper
mpif77 Fortran77 compiler wrapper
mpif90 Fortran90 compiler wrapper
mpirun command to launch mpi parallel job
Environment Variables to specify the Compilers to use:
export I_MPI_CC=icc
export I_MPI_CXX=icpc
export I_MPI_F90=ifort
export I_MPI_F77=ifort
Building Applications: Compiling MPI based Codes
12
Building Applications: Compiling MPI based Codes (Contd…)
mpif90 –O3 –Mextend –Mcache_align –k8-64 ftn.f
mpif90 –O3 –extend_source –march=opteron ftn.f
mpif90 –O2 -xHOST –fp-model strict -openmp ftn.f
mpicc –O3 –Mcache_align –k8-64 code.c
mpicc –O3 –march=opteron code.c
mpicc –O3 -xAVX2 -openmp –ftz –IPF_fma code.c
Compilers and Interface chosen depend on:
what is defined in your PATH variable
what are defined by (for Intel MPI):
• I_MPI_CC, I_MPI_CXX
• I_MPI_F77, I_MPI_F90
Intel Xeon Processor
13
Complete Specifications at:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
Intel Xeon Processors: Turbo, AVX and more
Intel Xeon Processors: Turbo, AVX and more (Contd …)
Complete Specifications at:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
Intel Xeon Processors: Turbo, AVX and more (Contd…)
Complete Specifications at:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
Intel Xeon Processors: Turbo, AVX and more (Contd …)
Intel publishes 4 different reference frequencies for every Xeon Processor:
1. Base Frequency 2. Non-AVX Turbo 3. AVX Base Frequency 4. AVX Turbo
• Turbo clock for a given model can vary as much as 5% from one processor to another
• Four possible scenarios exist:
• Turbo=OFF and AVX=NO => Clock is set to Base frequency
• Turbo=ON and AVX=NO => Clock range will be from Base to Non-AVX Turbo
• Turbo=OFF and AVX=YES => Clock range will be from AVX Base to Base Frequency
• Turbo=ON and AVX=YES => Clock range will be from AVX Base to AVX Turbo
Efficient Methods in Executing Applications
18
Running Parallel Programs in a Cluster: Intel MPI
– Environments in General
– export PATH
– export LD_LIBRARY_PATH
– export MPI_ROOT
– export I_MPI_FABRICS= shm:dapl
– export I_MPI_DAPL_PROVIDER=ofa-v2-mlx5_0-1u
– export NPROCS=256
– export PPN=16
– export I_MPI_PIN_PROCESSOR_LIST 0-15
– export OMP_NUM_THREADS=2
– export KMP_STACKSIZE=400M
– export KMP_SCHEDULE= static,balanced
– Example Command using Intel MPI
– time mpirun -np $NPROCS -hostfile ./hosts -genvall –ppn $PPN –genv I_MPI_PIN_DOMAIN=omp
./myprogram.exe
19
Profiling a Parallel Program: Intel MPI
– Using Intel MPS (MPI Performance Snapshot)
– Set all env variables to run Intel MPI based application
– Source the following additionally:
source /opt/intel/16.0/itac/9.1.2.024/intel64/bin/mpsvars.sh –papi | vtune
– Run your application as: mpirun –mps -np $NPROCS -hostfile ./hosts ….
– Two files app_stat_xxx.txt and stats_xxx.txt will be available at the end of the job.
– Analyze the these *.txt using mps tool
– Sample data you can gather from:
– Computation Time: 174.54 sec 51.93%
– MPI Time: 161.58 sec 48.07%
– MPI Imbalance: 147.27 sec 43.81%
– OpenMP Time: 155.79 sec 46.35%
– I/O wait time: 576.47 sec ( 0.08 %)
– Using Intel MPI built-in Profiling Capabilities
– Native mode: mpirun -env I_MPI_STATS 1-4 -env I_MPI_STATS_FILE native_1to4.txt …
– IPM mode: mpirun -env I_MPI_STATS ipm -env I_MPI_STATS_FILE ipm_full.txt
20
Tools and Techniques for Boosting Performance
21
Tools, Techniques and Commands
– Check Linux pseudo files and confirm the system details
– cat /proc/cpuinfo >> provides processor details (Intel’s tool cpuinfo.x)
– cat /proc/meminfo >> shows the memory details
– /usr/ sbin /ibstat >> shows the Interconnect IB fabric details
– /sbin/sysctl –a >> shows details of system (kernel, file system etc.)
– /usr/bin/lscpu >> shows cpu details including cache sizes
– /usr/bin/lstopo >> shows the hardware topology
– /bin/uname -a >> shows the system information
– /bin/rpm –qa >> shows the list of installed products including versions
– cat /etc/redhat-release >> shows the redhat version
– /usr/sbin/dmidecode >> shows system hardware and other details (need to be root)
– /bin/ dmesg >> shows system boot-up messages
– /usr/bin/numactl >> checks or sets NUMA policy for processes or shared memory
– /usr/bin/taskset >> shows cores and memory of numa nodes of a system
Top10 Practical Tips for Boosting Performance
– Check the system details thoroughly (Never assume !)
– Choose a compiler and MPI to build your application ( All are not same !)
– Start with some basic compiler flags and try additional flags one at a time (Optimization is incremental !)
– Use the built-in libraries and tools to save time and improve performance (Libs., Tools are your friends !)
– Change compiler and MPI if your code fails to compile or run correctly (Trying to fix things is futile !)
– Test your application at every level to arrive at an optimized code (Remember the 80-20 rule !)
– Customize your runtime environment to achieve desired goals (process parallel, hybrid run etc.)
– Always place and bind the processes and threads appropriately (Life saver !)
– Gather, check and correct your runtime environment (what you get may not be what you want !)
– Profile and adjust optimization and runtime environments accordingly (Exercise caution !)
23
Application Performance Highlights
24
Application: High Performance Linpack (HPL)
Description:
Benchmark to measure floating point rates (and times) by solving a random dense linear system of
equations in double-precision.
• Originally developed by Jack Dongarra at Univ. of Tennessee.
• Used Intel optimized HPL binary for this study.
• Ran the code in hybrid mode, one MPI process per processor and each process launched threads equal to
no. of cores on the processor.
• Used explicit placing and binding of threads.
• Attempted various choices of array sizes and other parameters to identify the best performance.
• The code provides a self-check to validate the results.
Additional details at: http://icl.eecs.utk.edu/hpl/
26
Proc.Type Processor Clock (GHz) # cores/proc #cores/node TDP (Watts) L3 Cache (MB) Rpeak (GFLOPS) Rmax (GFLOPS) % Peak
IvyBridge E5-2695 v2 2.4 12 24 115 30 461 458 99.35
IvyBridge E5-2670 v2 2.5 10 20 115 25 400 389 97.25
Haswell E5-2697 v3 2.6 14 28 145 35 1160 934 80.52
Haswell E5-2698 v3 2.3 16 32 135 40 1180 943 79.92
Broadwell E5-2690 v4 2.6 14 28 135 35 1164 1073 92.18
Broadwell E5-2697 v4 2.3 18 36 145 45 1324 1186 89.58
HPL Performance from a Single Node
461
400
1160 1180 1164
1324
458
389
934 943
1073
1186
99.35 97.25
80.52 79.92
92.18 89.58
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0
200
400
600
800
1000
1200
1400
E5-2695 v2 E5-2670 v2 E5-2697 v3 E5-2698 v3 E5-2690 v4 E5-2697 v4
%PEAK
GFLOPS
PROCESSOR
HPL on One Node
Rpeak (GFLOPS) Rmax (GFLOPS) % Peak
27
HPL Performance from a Haswell Cluster
81.08
79.41
77.91 78.06
77.09
75.00
76.00
77.00
78.00
79.00
80.00
81.00
82.00
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
8 16 40 80 120
%PEAK
GFLOPS
NO. OF NODES
HPL on a Haswell Cluster
Rpeak (GFLOPS) Rmax (GFLOPS) % Peak
76.21
75.95 75.95
75.74
75.65
75.49
75.55
75.00
75.20
75.40
75.60
75.80
76.00
76.20
76.40
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
200 280 288 300 600 1000 1200
%PEAK
GFLOPS
NO. OF NODES
HPL on a Haswell Cluster
Rpeak (GFLOPS) Rmax (GFLOPS) % Peak
System: BL460c Gen9, Intel E5-2680 v3, 2.5 GHz, 2P/24C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1,
Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF
Application: High Performance Conjugate Gradient (HPCG)
Description:
Benchmark designed to create a new metric for ranking HPC systems, complementing the current HPL
benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a
different and broad set of important HPC applications.
• Supports various operations in a standalone and an unified code.
• Reference implementation is written in C++ with MPI and OpenMP support.
• Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set
of coarse grids.
• Unlike the HPL, HPCG can be run for predetermined time (input).
• Local domain size (input) for a node is replicated to identify a global domain resulting in near-linear speed-up.
• Performance is measured by GFLOP/s rating reported by the code.
• An Intel optimized HPCG binary was used for this benchmark study.
• Ran the HPCG binary in hybrid mode, MPI processes + OpenMP threads.
Additional details at: http://hpcg-benchmark.org/
29
High Performance Conjugate Gradient (HPCG)
100.00 97.99 97.23 99.36 97.79 97.60
93.73 92.32
84.01 84.62
75.55
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0
500
1000
1500
2000
2500
3000
3500
4000
1 10 20 40 50 80 100 200 300 400 500
%SPEEDUP
GFLOPS
NO. OF NODES
HPCG on a Haswell Cluster
Blade1
Standard Distribution
6.81
6.56
6.01
5.11
6.07
4.34
5.12
0.81
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1 4 8 16 24 32 48 64
%IMPROVEMENT
GFLOPS
NO OF NODES
HPCG – Haswell Vs Broadwell
Haswell - FDR Broadwell - EDR % Improvement
HSW1 and BDW1
Intel Optimized Binary
Blade1: BL460c Gen9, Intel E5-2680 v3, 2.5 GHz, 2P/24C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1, Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF
HSW1: XL170r Gen9, Intel E5-2698 v3, 2.3 GHz, 2P/32C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1, Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF
BDW1: XL170r Gen9, Intel E5-2698 v4, 2.2 GHz, 2P/40C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.6, IB/EDR 1:1, Intel MPI / Intel Compiler 2016.2.181, Turbo ON, Hyperthreading OFF
Application: Graph500
Description:
Benchmark designed to address performance of data intensive HPC applications using Graph Algorithms
• The code generates problem size with a scale (input) creating vertices equal to 10scale
• The performance is measured in TEPS (Traversed Edges Per Second).
• The median_TEPS, in either GTEPS (Giga TEPS) or MTEPS (Mega TEPS) are reported.
• Used a source code optimized for a scale-out (DMP) system by Kyushu University (Japan).
• Application is written in C language.
• Compiled using GNU compiler, gcc.
• Code automatically detects the no. of processors and cores and runs optimally.
• No external placement and binding by the user are needed.
• Needs large memory foot-print to run very large scale problem.
Additional details at: http://www.graph500.org/
31
Graph500 Performance from a Haswell Cluster
39.87
74.27
133.39
179.14
212.42
264.15
281.57
0.00
50.00
100.00
150.00
200.00
250.00
300.00
32 64 128 192 256 384 512
8 16 32 48 64 96 128
MEDIAN_GTEPS
# NODES / # RANKS
Graph500 – DMP Version
System Type Architecture / Chip CPU Clock (GHz)
#
processors/node total # cores/node
Memory
(GB) OS Interconnect Memory Details MPI File System
Hestia
BL460c
Gen9 Intel64 / Haswell E5-2697 v3 2.60 2 28 128 RHEL 6.6
IB / FDR (Connect-
IB)
8x16 GB, 2R, DDR4
@ 2133 MHz Intel MPI 5.0.2.044 NFS
Application: Weather Research and Forecasting (WRF)
Description:
WRF is a Numerical Weather Prediction (NWP) model designed to serve both atmospheric research and
operational forecasting needs. NWP refers to the simulation and prediction of the atmosphere with a computer
model, and WRF is a set of software to accomplish this.
• The code was jointly developed by NCAR, NCEP, FSL , AFWA, NRL, Univ. of Oklahoma and FAA.
• WRF is freely distributed and supported by NCAR.
• Offers two dynamical solvers: WRF-ARW (Advanced Research WRF) and WRF-NMM (Nonhydrostatic
Mesoscale Model).
• Capabilities to mix and match modules to simulate various atmospheric conditions and couple with other
NWP models (Ocean Modeling codes)
• Can accommodate simulation with nested data domains, coarse to very fine grids in a single simulation.
• Popular data sets to port and optimize are: CONUS12 and CONUS2.5 (available at NCAR).
• Options to use dedicated processors for I/O (quilting) and various layout of processors (tiling) exist.
Additional details at: http://www.wrf-model.org/index.php
WRF (v 3.8.1) Results with CONUS 2.5km Data Set
0.0000
0.5000
1.0000
1.5000
2.0000
2.5000
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
4 8 16 24 32 48 64 96 128
Ave.Time/step(s)
Speedup
No. of Nodes
WRF Scaling with CONUS 2.5km
Ave. Time/step (s) Speedup
202 448
1017
1585
2034
3234
4065
5245
7964
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
4 8 16 24 32 48 64 96 128
GFLOPS
No. of Nodes
WRF GFLOPS with CONUS 2.5km
System: XL170r Gen9, Intel E5-2699 v4, 2.2 GHz, 2P/44C, 256 GB (DDR4-2133 MHz) Memory, RHEL 6.7, IB/EDR 1:1,
Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
Application: Clover Leaf
Description:
Clover Leaf is an open-source Computational Fluid Dynamics (CFD) code developed and distributed by UK
Mini-Application Consortium (UK-MAC).
• Solves compressible Euler equations on a Cartesian grid using explicit second-order accurate method.
• Uses a ‘kernel’ (low level building block) approach with minimal control logic to increase compiler
optimization.
• Supports for accelerators (using both OpenACC and OpenCL) available.
• Scarifies memory (saving intermediate results and than re-computing) to improve performance.
• Available in two flavors, in 2-Dimenional (2D) and 3-Dimensional (3D) modeling.
• Rugged and easy to port and run and good candidate for evaluating and comparing systems.
• Available large no. of data sets for 2D and 3D models with control of run times, few seconds to hours.
Additional details at: http://uk-mac.github.io/CloverLeaf/
Clover Leaf (3D) Results with bm256 Data Set
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
0
2000
4000
6000
8000
10000
12000
1 4 8 16 24 32 48 64 96 128
Speedup
WallClock(s)
No. of Nodes
Clover Leaf Scaling with bm256 Data
Wall Clock (s) Speedup
System: XL230a Gen9, Intel E5-2697A v4, 2.6 GHz, 2P/32C, 128 GB (DDR4-2400 MHz) Memory, RHEL 7.2,
IB/EDR 1:1, Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
Conclusions
36
Conclusions
37
 HP is No. 1 vendor in HPC and Cluster Solutions
 Configure and tune the system first
 Check the system details (processor, clock, memory and BIOS settings)
 Investigate compiler and flags that best suit your application
 Profile the application and optimize further for boosting performance
 Explore and decide on the right interconnect and protocols
 Take advantage of tools and commands to improve performance
 Run the application the right way (environment, placement etc.)
 Choose the right file system (local disk, NFS, Lustre, IBRIX etc.)
 Settle on an environment that is best for your application, time and value
 Never assume and always check the cluster before benchmarking
Thank you
logan.sankaran@hpe.com

More Related Content

What's hot

Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
inside-BigData.com
 
BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64
Linaro
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
Yutaka Kawai
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
Deepak Shankar
 
Xilinx vs Intel (Altera) FPGA performance comparison
Xilinx vs Intel (Altera) FPGA performance comparison Xilinx vs Intel (Altera) FPGA performance comparison
Xilinx vs Intel (Altera) FPGA performance comparison
Roy Messinger
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
inside-BigData.com
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
Deepak Tomar
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
inside-BigData.com
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
Ganesan Narayanasamy
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
Saifuddin Kaijar
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live Redfish
Bruno Cornec
 

What's hot (20)

Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
 
BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
Xilinx vs Intel (Altera) FPGA performance comparison
Xilinx vs Intel (Altera) FPGA performance comparison Xilinx vs Intel (Altera) FPGA performance comparison
Xilinx vs Intel (Altera) FPGA performance comparison
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live Redfish
 

Similar to Best Practices and Performance Studies for High-Performance Computing Clusters

Multicore
MulticoreMulticore
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
Qualcomm Developer Network
 
Creating an Embedded System Lab
Creating an Embedded System LabCreating an Embedded System Lab
Creating an Embedded System Lab
Nonamepro
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share
.Gastón. .Bx.
 
Software Variability Management
Software Variability ManagementSoftware Variability Management
Software Variability Management
XavierDevroey
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Talal Khaliq
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
DESMOND YUEN
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 
php & performance
 php & performance php & performance
php & performance
simon8410
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Cheer Chain Enterprise Co., Ltd.
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CanSecWest
 
Emulation Error Recovery
Emulation Error RecoveryEmulation Error Recovery
Emulation Error Recovery
somnathb1
 
PHP & Performance
PHP & PerformancePHP & Performance
PHP & Performance
毅 吕
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
AmrutaMehata
 
Using the Cypress PSoC Processor
Using the Cypress PSoC ProcessorUsing the Cypress PSoC Processor
Using the Cypress PSoC Processor
LloydMoore
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching Programmability
Robb Boyd
 
Presentation aix performance updates & issues
Presentation   aix performance updates & issuesPresentation   aix performance updates & issues
Presentation aix performance updates & issues
solarisyougood
 

Similar to Best Practices and Performance Studies for High-Performance Computing Clusters (20)

Multicore
MulticoreMulticore
Multicore
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Creating an Embedded System Lab
Creating an Embedded System LabCreating an Embedded System Lab
Creating an Embedded System Lab
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
AIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge ShareAIX Advanced Administration Knowledge Share
AIX Advanced Administration Knowledge Share
 
Software Variability Management
Software Variability ManagementSoftware Variability Management
Software Variability Management
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Hyper thread technology
Hyper thread technologyHyper thread technology
Hyper thread technology
 
php & performance
 php & performance php & performance
php & performance
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 
Emulation Error Recovery
Emulation Error RecoveryEmulation Error Recovery
Emulation Error Recovery
 
PHP & Performance
PHP & PerformancePHP & Performance
PHP & Performance
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
Using the Cypress PSoC Processor
Using the Cypress PSoC ProcessorUsing the Cypress PSoC Processor
Using the Cypress PSoC Processor
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching Programmability
 
Presentation aix performance updates & issues
Presentation   aix performance updates & issuesPresentation   aix performance updates & issues
Presentation aix performance updates & issues
 

More from Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Intel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
Intel® Software
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
Intel® Software
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
Intel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
Intel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
Intel® Software
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
Intel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
Intel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Software
 

More from Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Best Practices and Performance Studies for High-Performance Computing Clusters

  • 1. HPC Clusters: Best Practices and Performance Study
  • 2. Agenda  HPC at HPE  System Configuration and Tuning  Best Practices for Building Applications  Intel Xeon Processors  Efficient Methods in Executing Applications  Tools and Techniques for Boosting Performance  Application Performance Highlights  Conclusions 2
  • 4. HPE’s HPC Market and Share 4 HPE/HP, 33.7% Dell, 17.4%Lenovo, 12.9% IBM, 3.9% Cray, 2.4% SGI, 2.4%Sugon (Dawning) , 2.7% Fujitsu, 1.3% NEC, 1.2% Bull Atos, 0.9% Other, 16.0% Wuxi, 5.4% IDC HPC Market Share 2016 Top500 List (ISC2016, June 2016)
  • 6. Typical BIOS Settings: Processor Options – Hyperthreading Options Disabled : Better scaling for HPC workloads – Processor Core Disable 0 : Enables all available cores – Intel Turbo Boost Technology Enabled : Increases clock frequency (increase affected by factors). – ACPI SLIT Preferences Enabled : OS can improve performance by efficient allocation of resources among processor, memory and I/O subsystems. – QPI Snoop Configuration Home/Early/COD : Experiment and set the right configuration for your workload. Home: High Memory Bandwidth for average NUMA workloads. COD: Cluster On Die, Increased Memory Bandwidth for optimized and aggressive NUMA workloads. Early: Decreases latency but may also decrease memory bandwidth compared to other two modes. 6
  • 7. Typical BIOS Settings: Power Settings and Management – HPE Power Profile should be set to Maximum Performance to get best performance (idle and average power will increase significantly). – Custom Power Profile will reduce idle and average power at the expense of 1-2% performance reduction. – To get highest Turbo clock speeds (when partial cores are used), use Power Savings Settings. – For Custom Power Profile, you will have to set the following additional settings: 7
  • 8. Best Practices for Building Applications 8
  • 9. 9 -O2 enable optimizations ( = -O, Default) -O1 optimize for maximum speed, but disable some optimizations which increases code size for small speed benefit -O3 enable –O2 plus more aggressive optimizations that may or may not improve performance for all programs. -fast enable –O3 –ipo –static -xHOST optimize code based on the native node used for compilation -xAVX enable advanced vector instructions set (for Ivy Bridge performance) -xCORE-AVX2 enable advanced vector instructions set 2 (key Haswell/Broadwell performance) -xMIC-AVX512 enable advanced vector instructions set 512 (for future KNL/SkyLake based systems) -mp maintain floating-point precision (disables some optimizations) -parallel enable the auto parallelizer to generate multi-threaded code -openmp generate multi-threaded parallel code based on OpenMP directives -ftz enable/disable flushing denormalized results to zero -opt-streaming-stores [always auto never] generates streaming stores -mcmodel=[small medium large] controls the code and data memory allocation -fp-model=[fast precise source strict] controls floating point model variation -mkl=[parallel sequential cluster] link to Intel MKL Lib. to build optimized code. Building Applications: Intel Compiler Flags
  • 10. 10 Building Applications: Compiling Thread Parallel Codes pgf90 –mp –O3 –Mextend –Mcache_align –k8-64 ftn.f pathf90 –mp –O3 –extend_source –march=opteron ftn.f ifort –openmp –O3 -132 –i_dynamic –ftz –IPF_fma ftn.f pgcc –mp –O3 –Mcache_align –k8-64 code.c opencc –mp –O3 –march=opteron code.c icc –openmp –O3 –i_dynamic –ftz –IPF_fma code.c Combination Flags Intel: -fast => -O3 –ipo –static PGI: -fast => -O2 –Munroll –Mnoframe Open64: -Ofast => -O3 -ipa -OPT:Ofast -fno-math-errno Notes: • Must compile and link with –mp / –openmp • Aggressive optimizations may compromise accuracy
  • 11. 11 mpicc C compiler wrapper to build parallel code mpiCC C++ compiler wrapper mpif77 Fortran77 compiler wrapper mpif90 Fortran90 compiler wrapper mpirun command to launch mpi parallel job Environment Variables to specify the Compilers to use: export I_MPI_CC=icc export I_MPI_CXX=icpc export I_MPI_F90=ifort export I_MPI_F77=ifort Building Applications: Compiling MPI based Codes
  • 12. 12 Building Applications: Compiling MPI based Codes (Contd…) mpif90 –O3 –Mextend –Mcache_align –k8-64 ftn.f mpif90 –O3 –extend_source –march=opteron ftn.f mpif90 –O2 -xHOST –fp-model strict -openmp ftn.f mpicc –O3 –Mcache_align –k8-64 code.c mpicc –O3 –march=opteron code.c mpicc –O3 -xAVX2 -openmp –ftz –IPF_fma code.c Compilers and Interface chosen depend on: what is defined in your PATH variable what are defined by (for Intel MPI): • I_MPI_CC, I_MPI_CXX • I_MPI_F77, I_MPI_F90
  • 15. Intel Xeon Processors: Turbo, AVX and more (Contd …) Complete Specifications at: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
  • 16. Intel Xeon Processors: Turbo, AVX and more (Contd…) Complete Specifications at: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
  • 17. Intel Xeon Processors: Turbo, AVX and more (Contd …) Intel publishes 4 different reference frequencies for every Xeon Processor: 1. Base Frequency 2. Non-AVX Turbo 3. AVX Base Frequency 4. AVX Turbo • Turbo clock for a given model can vary as much as 5% from one processor to another • Four possible scenarios exist: • Turbo=OFF and AVX=NO => Clock is set to Base frequency • Turbo=ON and AVX=NO => Clock range will be from Base to Non-AVX Turbo • Turbo=OFF and AVX=YES => Clock range will be from AVX Base to Base Frequency • Turbo=ON and AVX=YES => Clock range will be from AVX Base to AVX Turbo
  • 18. Efficient Methods in Executing Applications 18
  • 19. Running Parallel Programs in a Cluster: Intel MPI – Environments in General – export PATH – export LD_LIBRARY_PATH – export MPI_ROOT – export I_MPI_FABRICS= shm:dapl – export I_MPI_DAPL_PROVIDER=ofa-v2-mlx5_0-1u – export NPROCS=256 – export PPN=16 – export I_MPI_PIN_PROCESSOR_LIST 0-15 – export OMP_NUM_THREADS=2 – export KMP_STACKSIZE=400M – export KMP_SCHEDULE= static,balanced – Example Command using Intel MPI – time mpirun -np $NPROCS -hostfile ./hosts -genvall –ppn $PPN –genv I_MPI_PIN_DOMAIN=omp ./myprogram.exe 19
  • 20. Profiling a Parallel Program: Intel MPI – Using Intel MPS (MPI Performance Snapshot) – Set all env variables to run Intel MPI based application – Source the following additionally: source /opt/intel/16.0/itac/9.1.2.024/intel64/bin/mpsvars.sh –papi | vtune – Run your application as: mpirun –mps -np $NPROCS -hostfile ./hosts …. – Two files app_stat_xxx.txt and stats_xxx.txt will be available at the end of the job. – Analyze the these *.txt using mps tool – Sample data you can gather from: – Computation Time: 174.54 sec 51.93% – MPI Time: 161.58 sec 48.07% – MPI Imbalance: 147.27 sec 43.81% – OpenMP Time: 155.79 sec 46.35% – I/O wait time: 576.47 sec ( 0.08 %) – Using Intel MPI built-in Profiling Capabilities – Native mode: mpirun -env I_MPI_STATS 1-4 -env I_MPI_STATS_FILE native_1to4.txt … – IPM mode: mpirun -env I_MPI_STATS ipm -env I_MPI_STATS_FILE ipm_full.txt 20
  • 21. Tools and Techniques for Boosting Performance 21
  • 22. Tools, Techniques and Commands – Check Linux pseudo files and confirm the system details – cat /proc/cpuinfo >> provides processor details (Intel’s tool cpuinfo.x) – cat /proc/meminfo >> shows the memory details – /usr/ sbin /ibstat >> shows the Interconnect IB fabric details – /sbin/sysctl –a >> shows details of system (kernel, file system etc.) – /usr/bin/lscpu >> shows cpu details including cache sizes – /usr/bin/lstopo >> shows the hardware topology – /bin/uname -a >> shows the system information – /bin/rpm –qa >> shows the list of installed products including versions – cat /etc/redhat-release >> shows the redhat version – /usr/sbin/dmidecode >> shows system hardware and other details (need to be root) – /bin/ dmesg >> shows system boot-up messages – /usr/bin/numactl >> checks or sets NUMA policy for processes or shared memory – /usr/bin/taskset >> shows cores and memory of numa nodes of a system
  • 23. Top10 Practical Tips for Boosting Performance – Check the system details thoroughly (Never assume !) – Choose a compiler and MPI to build your application ( All are not same !) – Start with some basic compiler flags and try additional flags one at a time (Optimization is incremental !) – Use the built-in libraries and tools to save time and improve performance (Libs., Tools are your friends !) – Change compiler and MPI if your code fails to compile or run correctly (Trying to fix things is futile !) – Test your application at every level to arrive at an optimized code (Remember the 80-20 rule !) – Customize your runtime environment to achieve desired goals (process parallel, hybrid run etc.) – Always place and bind the processes and threads appropriately (Life saver !) – Gather, check and correct your runtime environment (what you get may not be what you want !) – Profile and adjust optimization and runtime environments accordingly (Exercise caution !) 23
  • 25. Application: High Performance Linpack (HPL) Description: Benchmark to measure floating point rates (and times) by solving a random dense linear system of equations in double-precision. • Originally developed by Jack Dongarra at Univ. of Tennessee. • Used Intel optimized HPL binary for this study. • Ran the code in hybrid mode, one MPI process per processor and each process launched threads equal to no. of cores on the processor. • Used explicit placing and binding of threads. • Attempted various choices of array sizes and other parameters to identify the best performance. • The code provides a self-check to validate the results. Additional details at: http://icl.eecs.utk.edu/hpl/
  • 26. 26 Proc.Type Processor Clock (GHz) # cores/proc #cores/node TDP (Watts) L3 Cache (MB) Rpeak (GFLOPS) Rmax (GFLOPS) % Peak IvyBridge E5-2695 v2 2.4 12 24 115 30 461 458 99.35 IvyBridge E5-2670 v2 2.5 10 20 115 25 400 389 97.25 Haswell E5-2697 v3 2.6 14 28 145 35 1160 934 80.52 Haswell E5-2698 v3 2.3 16 32 135 40 1180 943 79.92 Broadwell E5-2690 v4 2.6 14 28 135 35 1164 1073 92.18 Broadwell E5-2697 v4 2.3 18 36 145 45 1324 1186 89.58 HPL Performance from a Single Node 461 400 1160 1180 1164 1324 458 389 934 943 1073 1186 99.35 97.25 80.52 79.92 92.18 89.58 0.00 20.00 40.00 60.00 80.00 100.00 120.00 0 200 400 600 800 1000 1200 1400 E5-2695 v2 E5-2670 v2 E5-2697 v3 E5-2698 v3 E5-2690 v4 E5-2697 v4 %PEAK GFLOPS PROCESSOR HPL on One Node Rpeak (GFLOPS) Rmax (GFLOPS) % Peak
  • 27. 27 HPL Performance from a Haswell Cluster 81.08 79.41 77.91 78.06 77.09 75.00 76.00 77.00 78.00 79.00 80.00 81.00 82.00 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 8 16 40 80 120 %PEAK GFLOPS NO. OF NODES HPL on a Haswell Cluster Rpeak (GFLOPS) Rmax (GFLOPS) % Peak 76.21 75.95 75.95 75.74 75.65 75.49 75.55 75.00 75.20 75.40 75.60 75.80 76.00 76.20 76.40 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 200 280 288 300 600 1000 1200 %PEAK GFLOPS NO. OF NODES HPL on a Haswell Cluster Rpeak (GFLOPS) Rmax (GFLOPS) % Peak System: BL460c Gen9, Intel E5-2680 v3, 2.5 GHz, 2P/24C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1, Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF
  • 28. Application: High Performance Conjugate Gradient (HPCG) Description: Benchmark designed to create a new metric for ranking HPC systems, complementing the current HPL benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important HPC applications. • Supports various operations in a standalone and an unified code. • Reference implementation is written in C++ with MPI and OpenMP support. • Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids. • Unlike the HPL, HPCG can be run for predetermined time (input). • Local domain size (input) for a node is replicated to identify a global domain resulting in near-linear speed-up. • Performance is measured by GFLOP/s rating reported by the code. • An Intel optimized HPCG binary was used for this benchmark study. • Ran the HPCG binary in hybrid mode, MPI processes + OpenMP threads. Additional details at: http://hpcg-benchmark.org/
  • 29. 29 High Performance Conjugate Gradient (HPCG) 100.00 97.99 97.23 99.36 97.79 97.60 93.73 92.32 84.01 84.62 75.55 0.00 20.00 40.00 60.00 80.00 100.00 120.00 0 500 1000 1500 2000 2500 3000 3500 4000 1 10 20 40 50 80 100 200 300 400 500 %SPEEDUP GFLOPS NO. OF NODES HPCG on a Haswell Cluster Blade1 Standard Distribution 6.81 6.56 6.01 5.11 6.07 4.34 5.12 0.81 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1 4 8 16 24 32 48 64 %IMPROVEMENT GFLOPS NO OF NODES HPCG – Haswell Vs Broadwell Haswell - FDR Broadwell - EDR % Improvement HSW1 and BDW1 Intel Optimized Binary Blade1: BL460c Gen9, Intel E5-2680 v3, 2.5 GHz, 2P/24C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1, Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF HSW1: XL170r Gen9, Intel E5-2698 v3, 2.3 GHz, 2P/32C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.5, IB/FDR 1:1, Intel MPI 5.0.1, Intel Composer XE 15.0.0, Turbo ON, Hyperthreading OFF BDW1: XL170r Gen9, Intel E5-2698 v4, 2.2 GHz, 2P/40C, 128 GB (DDR4-2133 MHz) Memory, RHEL 6.6, IB/EDR 1:1, Intel MPI / Intel Compiler 2016.2.181, Turbo ON, Hyperthreading OFF
  • 30. Application: Graph500 Description: Benchmark designed to address performance of data intensive HPC applications using Graph Algorithms • The code generates problem size with a scale (input) creating vertices equal to 10scale • The performance is measured in TEPS (Traversed Edges Per Second). • The median_TEPS, in either GTEPS (Giga TEPS) or MTEPS (Mega TEPS) are reported. • Used a source code optimized for a scale-out (DMP) system by Kyushu University (Japan). • Application is written in C language. • Compiled using GNU compiler, gcc. • Code automatically detects the no. of processors and cores and runs optimally. • No external placement and binding by the user are needed. • Needs large memory foot-print to run very large scale problem. Additional details at: http://www.graph500.org/
  • 31. 31 Graph500 Performance from a Haswell Cluster 39.87 74.27 133.39 179.14 212.42 264.15 281.57 0.00 50.00 100.00 150.00 200.00 250.00 300.00 32 64 128 192 256 384 512 8 16 32 48 64 96 128 MEDIAN_GTEPS # NODES / # RANKS Graph500 – DMP Version System Type Architecture / Chip CPU Clock (GHz) # processors/node total # cores/node Memory (GB) OS Interconnect Memory Details MPI File System Hestia BL460c Gen9 Intel64 / Haswell E5-2697 v3 2.60 2 28 128 RHEL 6.6 IB / FDR (Connect- IB) 8x16 GB, 2R, DDR4 @ 2133 MHz Intel MPI 5.0.2.044 NFS
  • 32. Application: Weather Research and Forecasting (WRF) Description: WRF is a Numerical Weather Prediction (NWP) model designed to serve both atmospheric research and operational forecasting needs. NWP refers to the simulation and prediction of the atmosphere with a computer model, and WRF is a set of software to accomplish this. • The code was jointly developed by NCAR, NCEP, FSL , AFWA, NRL, Univ. of Oklahoma and FAA. • WRF is freely distributed and supported by NCAR. • Offers two dynamical solvers: WRF-ARW (Advanced Research WRF) and WRF-NMM (Nonhydrostatic Mesoscale Model). • Capabilities to mix and match modules to simulate various atmospheric conditions and couple with other NWP models (Ocean Modeling codes) • Can accommodate simulation with nested data domains, coarse to very fine grids in a single simulation. • Popular data sets to port and optimize are: CONUS12 and CONUS2.5 (available at NCAR). • Options to use dedicated processors for I/O (quilting) and various layout of processors (tiling) exist. Additional details at: http://www.wrf-model.org/index.php
  • 33. WRF (v 3.8.1) Results with CONUS 2.5km Data Set 0.0000 0.5000 1.0000 1.5000 2.0000 2.5000 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 4 8 16 24 32 48 64 96 128 Ave.Time/step(s) Speedup No. of Nodes WRF Scaling with CONUS 2.5km Ave. Time/step (s) Speedup 202 448 1017 1585 2034 3234 4065 5245 7964 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 4 8 16 24 32 48 64 96 128 GFLOPS No. of Nodes WRF GFLOPS with CONUS 2.5km System: XL170r Gen9, Intel E5-2699 v4, 2.2 GHz, 2P/44C, 256 GB (DDR4-2133 MHz) Memory, RHEL 6.7, IB/EDR 1:1, Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
  • 34. Application: Clover Leaf Description: Clover Leaf is an open-source Computational Fluid Dynamics (CFD) code developed and distributed by UK Mini-Application Consortium (UK-MAC). • Solves compressible Euler equations on a Cartesian grid using explicit second-order accurate method. • Uses a ‘kernel’ (low level building block) approach with minimal control logic to increase compiler optimization. • Supports for accelerators (using both OpenACC and OpenCL) available. • Scarifies memory (saving intermediate results and than re-computing) to improve performance. • Available in two flavors, in 2-Dimenional (2D) and 3-Dimensional (3D) modeling. • Rugged and easy to port and run and good candidate for evaluating and comparing systems. • Available large no. of data sets for 2D and 3D models with control of run times, few seconds to hours. Additional details at: http://uk-mac.github.io/CloverLeaf/
  • 35. Clover Leaf (3D) Results with bm256 Data Set 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 0 2000 4000 6000 8000 10000 12000 1 4 8 16 24 32 48 64 96 128 Speedup WallClock(s) No. of Nodes Clover Leaf Scaling with bm256 Data Wall Clock (s) Speedup System: XL230a Gen9, Intel E5-2697A v4, 2.6 GHz, 2P/32C, 128 GB (DDR4-2400 MHz) Memory, RHEL 7.2, IB/EDR 1:1, Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
  • 37. Conclusions 37  HP is No. 1 vendor in HPC and Cluster Solutions  Configure and tune the system first  Check the system details (processor, clock, memory and BIOS settings)  Investigate compiler and flags that best suit your application  Profile the application and optimize further for boosting performance  Explore and decide on the right interconnect and protocols  Take advantage of tools and commands to improve performance  Run the application the right way (environment, placement etc.)  Choose the right file system (local disk, NFS, Lustre, IBRIX etc.)  Settle on an environment that is best for your application, time and value  Never assume and always check the cluster before benchmarking