SlideShare a Scribd company logo
1 of 19
Download to read offline
Current R&D hands on work forpre-Exascale HPC systems 
Joshua Mora, PhD 
Agenda 
•Heterogeneous HW flexibility on SW request. 
•PRACE 2011 prototype, 1TF/kW efficiency ratio. 
•Performance debugging of HW and SW on multi-socket, multi-chipset, multi-GPUs, multi-rail (IB). 
•Affinity/binding. 
•In background: [c/nc] HT topologies, I/O performance implications, performance counters, NIC/GPU device interaction with processors, SW stacks. 
1 
11/3/2010pre-Exascale HPC systems
If (HPC==customization) {Heterogeneous HW/SW flexibility;} else {Enterprise solutions;} 
On the HW side, customization is all about 
•Multi-socket systems, to provide cpucomputing density (ie. aggregated G[FLOP/Byte]/s), and memory capacity. 
•Multi-chipset systems, to hook multiple GPUs (to accelerate computation) and NICs (tightly coupled processors/gpusacross computing nodes). 
But 
•Pay attention, among other things, to [nc/c]HT topologies to avoid running out of bandwidth (ie. resource limitations) between devices when the HPC application starts pushing to the limits in all directions. 
2 
11/3/2010pre-Exascale HPC systems
3QDR IB switch 
•36 portsFAT NODE in 4U 
•8xSix-core @ 2.6GH 
•256GB RAM @ DDR2-667 
•4xPCIegen2 
•4xQDR IB , 2x IB ports 
•RNAcacheapplianceCLUSTER NODE 01 in 1U 
•2xSix-core @ 2.6GHz 
•16GB RAM @ DDR2-800 
•1xPCIegen2 
•1xQDR IB, 1 IB port 
•RNAcacheclientCLUSTER NODE 08 in 1U 
•2xSix-core @ 2.6GHz 
•16GB RAM @ DDR2-800 
•1xPCIegen2 
•1xQDR IB, 1 IB port 
•RNAcacheclient6GB/s bidir 
6GB/s bidir 
6GB/s bidir 
6GB/s bidir 
6GB/s bidir 
6GB/s bidir 
Example: RAM NFS (RNANETWORKS) over 4 QDR IB cards 
2x10GB/s 
2x10GB/s 
8x10GB/s 
Ultra fast local/swap/shared cluster file system Reads/Writes @ 20GB/sec 
aggregated 
11/3/2010pre-Exascale HPC systems
Example: OpenMPapplication on NUMA system 
Better to configure system as memory node not interleaved ? 
Or memory node interleaved enabled ? Huge cHTtraffic, limitation. 
Answer : modify application to exploit NUMA system by allocating local arrays after threads have started. 
405 
10 
15 
00:00.509 
00:02.237 
00:03.965 
00:05.693 
00:07.421 
00:09.149 
GB/s 
Time 
DRAM bw (GB/s) no interleaved 
0.0 
2.0 
4.06.000:00.509 
00:02.237 
00:03.965 
00:05.693 
00:07.421 
00:09.149 
GB/s 
Time 
DRAM bw (GB/s) interleaved 
gsmpsrch_interleave_t00N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 
gsmpsrch_interleave_t06N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 
gsmpsrch_interleave_t12N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 
gsmpsrch_interleave_t18N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 
11/3/2010pre-Exascale HPC systems
PRACE 2011 prototype, HW description 
5 
11/3/2010pre-Exascale HPC systems 
Multi rail QDR IB, 
Fat tree topology
PRACE 2011 prototype, HW description (cont.) 
•6U Sandwich unit 
replicated 6 times 
within 42U rack. 
•Comp.Nodesconnected 
with MultirailIB. 
•Single 36 IB port switch 
(fat tree). 
•Management 1Gb/s 
6 
0
PRACE 2011 prototype, SW description 
•“Plain” C/C++/Fortran (application) 
•Pragmadirectives for OpenMP(multi threading) 
•Pragmadirectives for HMPP (CAPS enterprise) (GPUs) 
•MPI (OpenMPI,MVAPICH) (communications) 
•GNU/Open64/CAPS (compilers) 
•GNU debuggers 
•ACML, LAPACK,.. (math libraries) 
•“Plain” OS (eg. SLES11sp1, support for Multi-chip Module) 
•No need to develop kernels in 
•OpenCL(for ATI cards) 
•CUDA (for NVIDIA cards). 
711/3/2010pre-Exascale HPC systems
PRACE 2011 prototype : exceeding all requirements 
Requirements and Metrics 
•Budget of 400k EUR 
•20 TFLOP/s peak 
•Contained in 1 square meter (ie. single rack) 
•0.6 TFLOP/kW 
•This system delivers real double precision 10TFLOP/s in rack 
within 10kW, assuming multi-cores only pump in and out data to 
GPUs. 
•100 racks can deliver 1PFLOP within 1MW, >>20Million EUR. 
•Reaching affordability point in private sector (eg. O&G). 
8 
Metric Requirement Proposal Exceeded 
peak TF/s 20 22 yes 
m^2 1 1 (42U rack) met 
TF/kW 0.6 1.1 yes 
EUR 400k 250K yes 
11/3/2010 pre-Exascale HPC systems
PRACE 2011 prototype : HW flexibility on SW request 
NextIObox: 
•PCIegen2 switch box/appliance (GPUs, IB, FusionIO,..) 
•Connection of x8 and x16 PCIegen1/2 devices 
•Hot swappable PCI devices 
•Dynamically reconfigurable through SW without having to reboot system 
•Virtualization capabilities through IOMMU 
•Applications can be “tuned” to use more or less GPUs on request: 
Some GPU kernels are very efficient and the PCI bwrequirements are low Can dynamically allocate more GPUs to fewer compute nodes increasing Performance/Watt. 
•Targeting SC10 to do a demo/cluster challenge. 
9 
11/3/2010pre-Exascale HPC systems
Performance debugging of HW/SW on multi…you name it. 
Example: 2 processor + 2 chipsets + 2 GPUs 
PCIeSpeedTestavailable at developer.amd.com 
10 
CPU 1 
CPU 0GPU 0CPU 2 
CPU 3 
Chipset 1 
Mem1 
Mem0Mem2 
Mem3 
ncHT 
ncHT 
GPU 1 
Chipset 0 
b-12 GB/s 
u-6 GB/s 
u-6 GB/sb-12 GB/sb-12 GB/s 
b-12 GB/s 
b-12 GB/s 
b-12 GB/s 
6 
6 
6 
6 
5 
5 
Pay attention to cHTtopology and 
I/O balancing since 
cHTlinks can restrict CPUGPU bandwidth 
u-5 GB/s restricted bw 
11/3/2010pre-Exascale HPC systems
Performance debugging of HW on multi…you name it. 
Problem: CPUGPU good, GPUCPU bad, why ? 
11 
0.0 
1.0 
2.0 
3.0 
4.0 
5.0 
6.0 
7.0 
8.0 
9.0 
00:00.376 00:09.016 00:17.656 00:26.296 00:34.936 00:43.576 00:52.216 
Thousands 
Time 
bw (GB/s) core0_mem0 to/from GPU 
GPU_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) GPU_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s) 
GPU_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s) GPU_00t N3. HYPERTRANSPORT LINKS HT3 xmit (MB/s) 
First phase: CPU  GPU at ~6.4GB/s, memory controller does reads. 
Second phase: GPUCPU at 3.5GB/s, something is wrong. 
On second phase: memory controller is doing writes and reads. Reads 
do not make sense. Memory controller should only do writes. 
CPUGPU GPUCPU 
??? 
Low 
good 
11/3/2010 pre-Exascale HPC systems
While we figure out why we 
have reads on MCT from 
GPUCPU, lets look at the 
affinity/binding 
•On Node 0GPU 0 
• GPU 0CPU 0, u-3.5GB/s 
•On Node 1 GPU 0 
•GPU 0CPU 1, u-2.1GB/s 
•On Node 1 (process) but 
memory/GART in Node 0 
•GPU 0  CPU 1, u-3.5GB/s 
•We cannot pin buffers for all 
Nodes, to the closest Node to 
the GPU, since overloads MCT 
of that Node. 
12 
0 
1 
2 
3 
4 
5 
6 
7 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (bytes) 
core0_node0 CPU->GPU (GB/s) core0_node0 GPU->CPU (GB/s) 
0 
1 
2 
3 
4 
5 
6 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (bytes) 
core4_node1 CPU->GPU (GB/s) core4_node1 GPU->CPU (GB/s) 
0 
1 
2 
3 
4 
5 
6 
7 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (bytes) 
core4_node0 CPU->GPU (GB/s) core4_node0 GPU->CPU (GB/s) 
11/3/2010 pre-Exascale HPC systems
Linux/Windows driver fix got rid of reads on GPUCPU 
13 
0.0 
1.0 
2.0 
3.0 
4.0 
5.0 
6.0 
7.0 
8.0 
00:00.512 00:09.152 00:17.792 00:26.432 00:35.072 00:43.712 00:52.352 
Thousands 
Time 
bw (GB/s) core0_mem0 to/from GPU 
c0_n0_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) c0_n0_04t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) 
c0_n0_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s) c0_n0_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s) 
-100.0 
-50.0 
0.0 
50.0 
100.0 
150.0 
200.0 
250.0 
262144 1048576 4194304 16777216 67108864 268435456 
% improvement 
size (Bytes) 
core0_mem0 % improvement 
CPU->GPU (GB/s) GPU->CPU (GB/s) 
Only 
reads 
Only 
writes 
11/3/2010 pre-Exascale HPC systems
Linux affinity/binding for CPU and GPU 
14 
It is important to set process/thread affinity/binding as well as memory affinity/binding. 
•Externally using numactl. Caution, getting obsolete on Multi-Chip- Modules (MCM), since it sees a single chip of all the cores on the socket with single memory controller and single L3 shared cache. 
•New tool: HWLOC (www.open-mpi.org/projects/hwloc), not yet implemented memory binding. Being used in OpenMPIand MVAPICH. 
•HWLOC replaces also PLPA (obsolete) since It cannot see properly MCM processors (eg. Magny-Cours, Beckton). 
•For CPU and GPU work, set the process/thread affinity and local memory binding to it. Do not rely on first touch for GPU. 
•GPU driver can put GART on any node and incur into remote memory access when sending data from/to GPU. 
•Enforcing memory binding , will create GART buffers on same memory node, maximizing I/O throughput. 
11/3/2010pre-Exascale HPC systems
Linux results with processor and memory binding 
15 
0 
1 
2 
3 
4 
5 
6 
7 
8 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (Bytes) 
core0_mem0 
CPU->GPU (GB/s) GPU->CPU (GB/s) 
0 
1 
2 
3 
4 
5 
6 
7 
8 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (Bytes) 
core4_mem1 
CPU->GPU (GB/s) GPU->CPU (GB/s) 
11/3/2010 pre-Exascale HPC systems
Windows affinity/binding for CPU and GPU 
16 
•There is no numactlcommand on Windows 
•There is start /affinity command on DOS prompt that just sets process affinity. 
•HWLOC tool also available on Windows, but again memory node binding is mandatory. 
•Huge performance penalties if not done memory node. 
•Requires usage of API provided by Microsoft for Windows HPC and Windows 7. Main function call: VirtualAllocNumaEx 
•Debate among SW development teams on where the fix needs to be introduced. Possible SW: Application, OpenCL, CAL, User Driver, Kernel Driver. 
•Ideally, OpenCL/CAL should read affinity of process thread and before creating GART resources, set numanode binding. 
•Running out of time, quick fix implemented at application level (ie. microbenchmark). Concern about complexity since application developer needs to be aware of cHTtopology and NUMA nodes. 
11/3/2010pre-Exascale HPC systems
Portion of code changed and results 
17 
Example of allocating simple array with memory node binding: 
a = (double *) VirtualAllocExNuma( hProcess, NULL, array_sz_bytes, MEM_RESERVE | MEM_COMMIT, 
PAGE_READWRITE, NodeNumber); 
/* a=(double *)malloc(array_sz_bytes); */ 
for (i=0; i<array_sz; i++) a[i]= function(i); 
VirtualFreeEx( hProcess, (void *)a, array_sz_bytes, MEM_RELEASE); 
/* free( (void *)a); */ 
For PCIeSpeedTest, fix introduced at USER LEVEL with VirtualAllocNumaEx + 
calResCreate2D instead of calResAllocRemote2D which does inside CAL the 
allocations without having the posibility to set the affinity by the USER. 
0 
1 
2 
3 
4 
5 
6 
7 
8 
1 8 64 512 4096 32768 262144 2097152 16777216 134217728 
GB/s 
size (Bytes) 
Windows, memory node 0 binding 
CPU->GPU GPU->CPU 
11/3/2010 pre-Exascale HPC systems
Q&A 
18 
How many QDR IB cards can handle a single chipset ? Same performance as Gemini interconnect. 
THANK YOU 
0.0 
2.0 
4.0 
6.0 
8.0 
10.0 
00:00.510 
00:04.830 
00:09.150 
00:13.470 
00:17.790 
00:22.11000:26.430 
00:30.750 
Thousands of MB/s (GB/s) 
Time 
3 QDR IB cards, single PCIegen2, RDMA write (GB/s) on client side 
client_00t_okN2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 
client_00t_okN3. HYPERTRANSPORT LINKSHT3 xmit (MB/s) 
client_00t_okN3. HYPERTRANSPORT LINKSHT3 CRC (MB/s) 
11/3/2010pre-Exascale HPC systems
How to assess Application Efficiency with performance counters 
–Efficiency questions (Theoretical vsmaximum achievable vsreal applications) on IO subsystems: cHT, ncHT, (multi) chipsets, (multi) HCAs, (multi) GPUs, HDs, SSDs. 
–Most of those efficiency questions can be represented graphically. 
Example: roofline model (Williams), which allows to compare architectures and how efficiently the workloads exploit them. Values obtained through performance counters. 
Appendix: Understanding workload requirements1 
4 
16 
642560.03 
0.06 
0.13 
0.25 
0.50 
1.00 
2.00 
4.00 
8.00 
16.00 
32.00 
64.00 
2P G34 MagnyCours2.2 GHz Stream Triad (OP=0.082) DGEMM (OP=14.15) 
GF/s 
Arithmetic intensity (GF/s)/(GB/s) 
Peak GFLOP/s 
11/3/2010 
19 
GROMACS (OP=5, GF=35)

More Related Content

What's hot

Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalTommy Lee
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationedlangley
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6edlangley
 
BKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPABKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPALinaro
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapedlangley
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Q4.11: Sched_mc on dual / quad cores
Q4.11: Sched_mc on dual / quad coresQ4.11: Sched_mc on dual / quad cores
Q4.11: Sched_mc on dual / quad coresLinaro
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100inside-BigData.com
 
イマドキなNetwork/IO
イマドキなNetwork/IOイマドキなNetwork/IO
イマドキなNetwork/IOTakuya ASADA
 
Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术Feng Yu
 

What's hot (20)

Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
ZFSperftools2012
ZFSperftools2012ZFSperftools2012
ZFSperftools2012
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6
 
BKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPABKK16-317 How to generate power models for EAS and IPA
BKK16-317 How to generate power models for EAS and IPA
 
Tuned
TunedTuned
Tuned
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
PROSE
PROSEPROSE
PROSE
 
Q4.11: Sched_mc on dual / quad cores
Q4.11: Sched_mc on dual / quad coresQ4.11: Sched_mc on dual / quad cores
Q4.11: Sched_mc on dual / quad cores
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Kvm optimizations
Kvm optimizationsKvm optimizations
Kvm optimizations
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100
 
イマドキなNetwork/IO
イマドキなNetwork/IOイマドキなNetwork/IO
イマドキなNetwork/IO
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术
 

Similar to R&D work on pre exascale HPC systems

Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoSRohit Jnagal
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdfJunZhao68
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxssuser30e7d2
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBMIBM Danmark
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneOpen-NFP
 

Similar to R&D work on pre exascale HPC systems (20)

Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
linux-memory-explained.pdf
linux-memory-explained.pdflinux-memory-explained.pdf
linux-memory-explained.pdf
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

R&D work on pre exascale HPC systems

  • 1. Current R&D hands on work forpre-Exascale HPC systems Joshua Mora, PhD Agenda •Heterogeneous HW flexibility on SW request. •PRACE 2011 prototype, 1TF/kW efficiency ratio. •Performance debugging of HW and SW on multi-socket, multi-chipset, multi-GPUs, multi-rail (IB). •Affinity/binding. •In background: [c/nc] HT topologies, I/O performance implications, performance counters, NIC/GPU device interaction with processors, SW stacks. 1 11/3/2010pre-Exascale HPC systems
  • 2. If (HPC==customization) {Heterogeneous HW/SW flexibility;} else {Enterprise solutions;} On the HW side, customization is all about •Multi-socket systems, to provide cpucomputing density (ie. aggregated G[FLOP/Byte]/s), and memory capacity. •Multi-chipset systems, to hook multiple GPUs (to accelerate computation) and NICs (tightly coupled processors/gpusacross computing nodes). But •Pay attention, among other things, to [nc/c]HT topologies to avoid running out of bandwidth (ie. resource limitations) between devices when the HPC application starts pushing to the limits in all directions. 2 11/3/2010pre-Exascale HPC systems
  • 3. 3QDR IB switch •36 portsFAT NODE in 4U •8xSix-core @ 2.6GH •256GB RAM @ DDR2-667 •4xPCIegen2 •4xQDR IB , 2x IB ports •RNAcacheapplianceCLUSTER NODE 01 in 1U •2xSix-core @ 2.6GHz •16GB RAM @ DDR2-800 •1xPCIegen2 •1xQDR IB, 1 IB port •RNAcacheclientCLUSTER NODE 08 in 1U •2xSix-core @ 2.6GHz •16GB RAM @ DDR2-800 •1xPCIegen2 •1xQDR IB, 1 IB port •RNAcacheclient6GB/s bidir 6GB/s bidir 6GB/s bidir 6GB/s bidir 6GB/s bidir 6GB/s bidir Example: RAM NFS (RNANETWORKS) over 4 QDR IB cards 2x10GB/s 2x10GB/s 8x10GB/s Ultra fast local/swap/shared cluster file system Reads/Writes @ 20GB/sec aggregated 11/3/2010pre-Exascale HPC systems
  • 4. Example: OpenMPapplication on NUMA system Better to configure system as memory node not interleaved ? Or memory node interleaved enabled ? Huge cHTtraffic, limitation. Answer : modify application to exploit NUMA system by allocating local arrays after threads have started. 405 10 15 00:00.509 00:02.237 00:03.965 00:05.693 00:07.421 00:09.149 GB/s Time DRAM bw (GB/s) no interleaved 0.0 2.0 4.06.000:00.509 00:02.237 00:03.965 00:05.693 00:07.421 00:09.149 GB/s Time DRAM bw (GB/s) interleaved gsmpsrch_interleave_t00N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) gsmpsrch_interleave_t06N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) gsmpsrch_interleave_t12N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) gsmpsrch_interleave_t18N2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) 11/3/2010pre-Exascale HPC systems
  • 5. PRACE 2011 prototype, HW description 5 11/3/2010pre-Exascale HPC systems Multi rail QDR IB, Fat tree topology
  • 6. PRACE 2011 prototype, HW description (cont.) •6U Sandwich unit replicated 6 times within 42U rack. •Comp.Nodesconnected with MultirailIB. •Single 36 IB port switch (fat tree). •Management 1Gb/s 6 0
  • 7. PRACE 2011 prototype, SW description •“Plain” C/C++/Fortran (application) •Pragmadirectives for OpenMP(multi threading) •Pragmadirectives for HMPP (CAPS enterprise) (GPUs) •MPI (OpenMPI,MVAPICH) (communications) •GNU/Open64/CAPS (compilers) •GNU debuggers •ACML, LAPACK,.. (math libraries) •“Plain” OS (eg. SLES11sp1, support for Multi-chip Module) •No need to develop kernels in •OpenCL(for ATI cards) •CUDA (for NVIDIA cards). 711/3/2010pre-Exascale HPC systems
  • 8. PRACE 2011 prototype : exceeding all requirements Requirements and Metrics •Budget of 400k EUR •20 TFLOP/s peak •Contained in 1 square meter (ie. single rack) •0.6 TFLOP/kW •This system delivers real double precision 10TFLOP/s in rack within 10kW, assuming multi-cores only pump in and out data to GPUs. •100 racks can deliver 1PFLOP within 1MW, >>20Million EUR. •Reaching affordability point in private sector (eg. O&G). 8 Metric Requirement Proposal Exceeded peak TF/s 20 22 yes m^2 1 1 (42U rack) met TF/kW 0.6 1.1 yes EUR 400k 250K yes 11/3/2010 pre-Exascale HPC systems
  • 9. PRACE 2011 prototype : HW flexibility on SW request NextIObox: •PCIegen2 switch box/appliance (GPUs, IB, FusionIO,..) •Connection of x8 and x16 PCIegen1/2 devices •Hot swappable PCI devices •Dynamically reconfigurable through SW without having to reboot system •Virtualization capabilities through IOMMU •Applications can be “tuned” to use more or less GPUs on request: Some GPU kernels are very efficient and the PCI bwrequirements are low Can dynamically allocate more GPUs to fewer compute nodes increasing Performance/Watt. •Targeting SC10 to do a demo/cluster challenge. 9 11/3/2010pre-Exascale HPC systems
  • 10. Performance debugging of HW/SW on multi…you name it. Example: 2 processor + 2 chipsets + 2 GPUs PCIeSpeedTestavailable at developer.amd.com 10 CPU 1 CPU 0GPU 0CPU 2 CPU 3 Chipset 1 Mem1 Mem0Mem2 Mem3 ncHT ncHT GPU 1 Chipset 0 b-12 GB/s u-6 GB/s u-6 GB/sb-12 GB/sb-12 GB/s b-12 GB/s b-12 GB/s b-12 GB/s 6 6 6 6 5 5 Pay attention to cHTtopology and I/O balancing since cHTlinks can restrict CPUGPU bandwidth u-5 GB/s restricted bw 11/3/2010pre-Exascale HPC systems
  • 11. Performance debugging of HW on multi…you name it. Problem: CPUGPU good, GPUCPU bad, why ? 11 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 00:00.376 00:09.016 00:17.656 00:26.296 00:34.936 00:43.576 00:52.216 Thousands Time bw (GB/s) core0_mem0 to/from GPU GPU_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) GPU_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s) GPU_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s) GPU_00t N3. HYPERTRANSPORT LINKS HT3 xmit (MB/s) First phase: CPU  GPU at ~6.4GB/s, memory controller does reads. Second phase: GPUCPU at 3.5GB/s, something is wrong. On second phase: memory controller is doing writes and reads. Reads do not make sense. Memory controller should only do writes. CPUGPU GPUCPU ??? Low good 11/3/2010 pre-Exascale HPC systems
  • 12. While we figure out why we have reads on MCT from GPUCPU, lets look at the affinity/binding •On Node 0GPU 0 • GPU 0CPU 0, u-3.5GB/s •On Node 1 GPU 0 •GPU 0CPU 1, u-2.1GB/s •On Node 1 (process) but memory/GART in Node 0 •GPU 0  CPU 1, u-3.5GB/s •We cannot pin buffers for all Nodes, to the closest Node to the GPU, since overloads MCT of that Node. 12 0 1 2 3 4 5 6 7 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (bytes) core0_node0 CPU->GPU (GB/s) core0_node0 GPU->CPU (GB/s) 0 1 2 3 4 5 6 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (bytes) core4_node1 CPU->GPU (GB/s) core4_node1 GPU->CPU (GB/s) 0 1 2 3 4 5 6 7 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (bytes) core4_node0 CPU->GPU (GB/s) core4_node0 GPU->CPU (GB/s) 11/3/2010 pre-Exascale HPC systems
  • 13. Linux/Windows driver fix got rid of reads on GPUCPU 13 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 00:00.512 00:09.152 00:17.792 00:26.432 00:35.072 00:43.712 00:52.352 Thousands Time bw (GB/s) core0_mem0 to/from GPU c0_n0_00t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) c0_n0_04t N2. MEMORY CONTROLLER Total DRAM accesses (MB/s) c0_n0_00t N2. MEMORY CONTROLLER Total DRAM writes (MB/s) c0_n0_00t N2. MEMORY CONTROLLER Total DRAM reads inc pf (MB/s) -100.0 -50.0 0.0 50.0 100.0 150.0 200.0 250.0 262144 1048576 4194304 16777216 67108864 268435456 % improvement size (Bytes) core0_mem0 % improvement CPU->GPU (GB/s) GPU->CPU (GB/s) Only reads Only writes 11/3/2010 pre-Exascale HPC systems
  • 14. Linux affinity/binding for CPU and GPU 14 It is important to set process/thread affinity/binding as well as memory affinity/binding. •Externally using numactl. Caution, getting obsolete on Multi-Chip- Modules (MCM), since it sees a single chip of all the cores on the socket with single memory controller and single L3 shared cache. •New tool: HWLOC (www.open-mpi.org/projects/hwloc), not yet implemented memory binding. Being used in OpenMPIand MVAPICH. •HWLOC replaces also PLPA (obsolete) since It cannot see properly MCM processors (eg. Magny-Cours, Beckton). •For CPU and GPU work, set the process/thread affinity and local memory binding to it. Do not rely on first touch for GPU. •GPU driver can put GART on any node and incur into remote memory access when sending data from/to GPU. •Enforcing memory binding , will create GART buffers on same memory node, maximizing I/O throughput. 11/3/2010pre-Exascale HPC systems
  • 15. Linux results with processor and memory binding 15 0 1 2 3 4 5 6 7 8 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (Bytes) core0_mem0 CPU->GPU (GB/s) GPU->CPU (GB/s) 0 1 2 3 4 5 6 7 8 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (Bytes) core4_mem1 CPU->GPU (GB/s) GPU->CPU (GB/s) 11/3/2010 pre-Exascale HPC systems
  • 16. Windows affinity/binding for CPU and GPU 16 •There is no numactlcommand on Windows •There is start /affinity command on DOS prompt that just sets process affinity. •HWLOC tool also available on Windows, but again memory node binding is mandatory. •Huge performance penalties if not done memory node. •Requires usage of API provided by Microsoft for Windows HPC and Windows 7. Main function call: VirtualAllocNumaEx •Debate among SW development teams on where the fix needs to be introduced. Possible SW: Application, OpenCL, CAL, User Driver, Kernel Driver. •Ideally, OpenCL/CAL should read affinity of process thread and before creating GART resources, set numanode binding. •Running out of time, quick fix implemented at application level (ie. microbenchmark). Concern about complexity since application developer needs to be aware of cHTtopology and NUMA nodes. 11/3/2010pre-Exascale HPC systems
  • 17. Portion of code changed and results 17 Example of allocating simple array with memory node binding: a = (double *) VirtualAllocExNuma( hProcess, NULL, array_sz_bytes, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE, NodeNumber); /* a=(double *)malloc(array_sz_bytes); */ for (i=0; i<array_sz; i++) a[i]= function(i); VirtualFreeEx( hProcess, (void *)a, array_sz_bytes, MEM_RELEASE); /* free( (void *)a); */ For PCIeSpeedTest, fix introduced at USER LEVEL with VirtualAllocNumaEx + calResCreate2D instead of calResAllocRemote2D which does inside CAL the allocations without having the posibility to set the affinity by the USER. 0 1 2 3 4 5 6 7 8 1 8 64 512 4096 32768 262144 2097152 16777216 134217728 GB/s size (Bytes) Windows, memory node 0 binding CPU->GPU GPU->CPU 11/3/2010 pre-Exascale HPC systems
  • 18. Q&A 18 How many QDR IB cards can handle a single chipset ? Same performance as Gemini interconnect. THANK YOU 0.0 2.0 4.0 6.0 8.0 10.0 00:00.510 00:04.830 00:09.150 00:13.470 00:17.790 00:22.11000:26.430 00:30.750 Thousands of MB/s (GB/s) Time 3 QDR IB cards, single PCIegen2, RDMA write (GB/s) on client side client_00t_okN2. MEMORY CONTROLLERTotal DRAM accesses (MB/s) client_00t_okN3. HYPERTRANSPORT LINKSHT3 xmit (MB/s) client_00t_okN3. HYPERTRANSPORT LINKSHT3 CRC (MB/s) 11/3/2010pre-Exascale HPC systems
  • 19. How to assess Application Efficiency with performance counters –Efficiency questions (Theoretical vsmaximum achievable vsreal applications) on IO subsystems: cHT, ncHT, (multi) chipsets, (multi) HCAs, (multi) GPUs, HDs, SSDs. –Most of those efficiency questions can be represented graphically. Example: roofline model (Williams), which allows to compare architectures and how efficiently the workloads exploit them. Values obtained through performance counters. Appendix: Understanding workload requirements1 4 16 642560.03 0.06 0.13 0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 2P G34 MagnyCours2.2 GHz Stream Triad (OP=0.082) DGEMM (OP=14.15) GF/s Arithmetic intensity (GF/s)/(GB/s) Peak GFLOP/s 11/3/2010 19 GROMACS (OP=5, GF=35)