ISUM 2012, Guanajuato, Mexico Hands on work on AMD technologies for HPC solutions Joshua.Mora@amd.comABSTRACT:The goal of this talk is to present in a practical way (through a handson session) how latest AMD technology works and meets currenthigh performance computing requirements. Concepts such as theperformance metrics of GFLOPs and GB/s, performance efficiencies ofFPU and memory controllers/channels, scalability of the multi socketplatforms, tuning tips such as process/thread affinity, multiInfiniband/GPU and their I/O affinity, impact of appropriate mathlibraries and compilers, power consumption characteristics on asystem when heavily stressed with different HPC workloads,….will bereviewed. By the end of the talk/session you should walk away withsome good foundation on what building block technologies matter foryou and how to design and exploit your own HPC solutions.
ISUM 2012, Guanajuato, Mexico Probe filterNecessary for scaling of memory bound applications, sinceit keeps track (cache directory in L3) of where data is onwhat memory bank when cores request data again.memory bandwidth aggregated Processors (GB/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 8 10 13 18.5 2 16 20 26 37# numanodes 4 21 40 52 74 8 22 80 104 148 FLOPs aggregated Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL (GF/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 29.44 44.16 44.16 58.88 2 58.88 88.32 88.32 117.76# numanodes 4 117.76 176.64 176.64 235.52 8 235.52 353.28 353.28 471.04
ISUM 2012, Guanajuato, Mexico Bulldozer architecture• Bulldozer compute unit – Core pair• Core shared resources – L2 cache – Floating Point Unit – Instruction scheduler – Power management• Core independent resources – L1 Data cache – Integer Unit
ISUM 2012, Guanajuato, Mexico Bulldozer block diagram• HPC workloads are using all the cores for the same nature of computation, also synchronized.• High workload flexibility such as in Cloud under power budgetExample: Cloud workloadscan use 1 core for integerwork and the other the wholeFPU for number crunching
ISUM 2012, Guanajuato, Mexico Socket block diagram16 cores grouped in 8 compute units by core-pairsgrouped in 2 numanodes. Each numanode has 2 memorychannels. The numanodes are interconnected throughcHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W
ISUM 2012, Guanajuato, Mexico Bulldozer architecture (cont)• Flexible Floating Point Unit – Work that 1 core can do. 8 DP FLOPs/clk – Work that 2 cores can do. 4 DP FLOPs/clk • Example of DGEMM from ACML.• FMA4 and FMA3 instructions – FMA4 on Interlagos d = a (+/-) b*c – FMA3 on Abudhabi c = a (+/-) b*c• AVX instructions – Increase IPC by compacting instructions
ISUM 2012, Guanajuato, Mexico Operating systems for Interlagos• Basic list of OS providing proper performance – Windows Server 2008 R2 – RHEL6.2 – CentOS 6.2 – SLES11sp2 – Scientific Linux 6.2Older versions need specific patches in order toperform.
ISUM 2012, Guanajuato, Mexico Compiler flags• Open64 version >= 4.2.5• GCC version >= 4.6• PGI version >= 11.9• Open64 and GCC – Compile/link flags: -Ofast -march=bdver1• PGI – Compile/link flags: -fast -tp Interlagos-64
ISUM 2012, Guanajuato, Mexico numactl –hardware and numastatDetecting wrong BIOS settings configuration of system ,If NODE INTERLEAVED was ENABLED then it would only be 1 Physicalnuma node with core ids 0,1,2….30,31 and with 64 GB of memory. memory on numa node and how much is available (free) Core ids for numa node 3 Good, no misses 23
ISUM 2012, Guanajuato, Mexico EXAMPLE using likwid Hybrid MPI+OPenMP• Build application file and launch mpi job with hybrid openMP with 1 thread per compute unit on 2 . Using 4 compute nodes.• export OMP_NUM_THREADS=4• mpirun –app ./appfile,• Where appfile is Repeated core id for the binding of MPI process + 4 worker threads-h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application-h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application-h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application-h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application…………………………………………….-h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application-h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application-h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application-h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application 24
ISUM 2012, Guanajuato, Mexico Putting it all togetherPre-exascale (high computing density) system – Multicore – Multisocket – Multichipset – Multirail – MultiGPU – dynamically reconfigurable multi root PCI devices through workload analysis
ISUM 2012, Guanajuato, MexicoMore @ http://developer.amd.com• X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/)• AMD Developer Tools (http://developer.amd.com/tools/)• AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/• AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide (http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf)• AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/)• AMD HPC (www.amd.com/hpc)• AMD APP SDK Documentation (http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx)• Using the x86 Open64 Compiler Suite (http://developer.amd.com/tools/open64/Documents/open64.html)• x86 Open64 184.108.40.206 Release Notes (http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt)• ACML 5.0 Information (http://developer.amd.com/libraries/acml/features/pages/default.aspx)• Software Optimization Guide for “Bulldozer” processors (http://support.amd.com/us/Processor_TechDocs/47414.pdf)• AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions (http://support.amd.com/us/Embedded_TechDocs/43479.pdf)• Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core, 2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite. http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf