ISUM 2012, Guanajuato, Mexico            Hands on work on     AMD technologies for HPC solutions                   Joshua....
ISUM 2012, Guanajuato, Mexico       Performance metrics– GFLOP/s (SP,DP) (SSE, FMA)– GB/s (SP,DP) (streaming stores)– Memo...
ISUM 2012, Guanajuato, MexicoRoofline model:
ISUM 2012, Guanajuato, Mexico                  Scalability• Hardware based:  – Multicore  – Numanodes in socket package  –...
ISUM 2012, Guanajuato, Mexico                              Probe filterNecessary for scaling of memory bound applications,...
ISUM 2012, Guanajuato, Mexico         Bulldozer architecture• Bulldozer compute unit  – Core pair• Core shared resources  ...
ISUM 2012, Guanajuato, Mexico         Bulldozer block diagram• HPC workloads are using all  the cores for the same  nature...
ISUM 2012, Guanajuato, Mexico           Socket block diagram16 cores grouped in 8 compute units by core-pairsgrouped in 2 ...
ISUM 2012, Guanajuato, Mexico    Bulldozer architecture (cont)• Flexible Floating Point Unit  – Work that 1 core can do. 8...
Where are FMA instructions used ?
ISUM 2012, Guanajuato, Mexico    Bulldozer architecture (cont)• Power management:  – Maxpower (eg. 135W), TDP (115W), ACP ...
ISUM 2012, Guanajuato, Mexico             Power management  P0              Boost P-states  P1  P2         P0        Base ...
ISUM 2012, Guanajuato, Mexico Coherent and non coherent fabric• Coherent Hypertransport fabric  – Connects the numanodes w...
ISUM 2012, Guanajuato, MexicoCoherent and non coherent fabric
ISUM 2012, Guanajuato, Mexico              Software Ecosystem• Operating Systems• Compilers  – Open64, GCC, PGI• Math libr...
ISUM 2012, Guanajuato, Mexico  Operating systems for Interlagos• Basic list of OS providing proper performance  – Windows ...
ISUM 2012, Guanajuato, Mexico                  Compiler flags•   Open64 version >= 4.2.5•   GCC version >= 4.6•   PGI vers...
ISUM 2012, Guanajuato, Mexico   AMD Core Math Library,download @ developer.amd.com
ISUM 2012, Guanajuato, Mexico  AMD Code Analyst Profiler,download @ developer.amd.com
ISUM 2012, Guanajuato, MexicoNUMA definition
ISUM 2012, Guanajuato, Mexico    Feeding locally versus remotely• Locally                            0   1       Channel 0...
ISUM 2012, Guanajuato, Mexico                      Affinity• numa [ctl/stat] tool (Linux)• Start tool (Windows)• HWLOC too...
ISUM 2012, Guanajuato, Mexico numactl –hardware and numastatDetecting wrong BIOS settings configuration of system ,If NODE...
ISUM 2012, Guanajuato, Mexico                   EXAMPLE using likwid                   Hybrid MPI+OPenMP• Build applicatio...
ISUM 2012, Guanajuato, Mexico         Putting it all togetherPre-exascale (high computing density) system  – Multicore  – ...
ISUM 2012, Guanajuato, Mexico
ISUM 2012, Guanajuato, MexicoMore @ http://developer.amd.com•   X86 Open64 Compilers Suite (http://developer.amd.com/tools...
Upcoming SlideShare
Loading in …5
×

AMD technologies for HPC

1,260 views

Published on

Hands on session on AMD technologies for HPC.

Published in: Technology, Travel
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,260
On SlideShare
0
From Embeds
0
Number of Embeds
51
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

AMD technologies for HPC

  1. 1. ISUM 2012, Guanajuato, Mexico Hands on work on AMD technologies for HPC solutions Joshua.Mora@amd.comABSTRACT:The goal of this talk is to present in a practical way (through a handson session) how latest AMD technology works and meets currenthigh performance computing requirements. Concepts such as theperformance metrics of GFLOPs and GB/s, performance efficiencies ofFPU and memory controllers/channels, scalability of the multi socketplatforms, tuning tips such as process/thread affinity, multiInfiniband/GPU and their I/O affinity, impact of appropriate mathlibraries and compilers, power consumption characteristics on asystem when heavily stressed with different HPC workloads,….will bereviewed. By the end of the talk/session you should walk away withsome good foundation on what building block technologies matter foryou and how to design and exploit your own HPC solutions.
  2. 2. ISUM 2012, Guanajuato, Mexico Performance metrics– GFLOP/s (SP,DP) (SSE, FMA)– GB/s (SP,DP) (streaming stores)– Memory Latency (local/remote)– Memory Bandwidth (local/remote)– Network Latency– Network Bandwidth– Message rate (Network)– IOPs, sustained reads/writes (storage)– Roofline model (performance modeling)
  3. 3. ISUM 2012, Guanajuato, MexicoRoofline model:
  4. 4. ISUM 2012, Guanajuato, Mexico Scalability• Hardware based: – Multicore – Numanodes in socket package – Multisocket – Probe filter (HT assist) – Multichipset• Software based: – Compiler, Math libraries, MPI, OpenMP, affinity. – Algorithm, computation/communication overlap, non blocking collectives.
  5. 5. ISUM 2012, Guanajuato, Mexico Probe filterNecessary for scaling of memory bound applications, sinceit keeps track (cache directory in L3) of where data is onwhat memory bank when cores request data again.memory bandwidth aggregated Processors (GB/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 8 10 13 18.5 2 16 20 26 37# numanodes 4 21 40 52 74 8 22 80 104 148 FLOPs aggregated Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL (GF/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 29.44 44.16 44.16 58.88 2 58.88 88.32 88.32 117.76# numanodes 4 117.76 176.64 176.64 235.52 8 235.52 353.28 353.28 471.04
  6. 6. ISUM 2012, Guanajuato, Mexico Bulldozer architecture• Bulldozer compute unit – Core pair• Core shared resources – L2 cache – Floating Point Unit – Instruction scheduler – Power management• Core independent resources – L1 Data cache – Integer Unit
  7. 7. ISUM 2012, Guanajuato, Mexico Bulldozer block diagram• HPC workloads are using all the cores for the same nature of computation, also synchronized.• High workload flexibility such as in Cloud under power budgetExample: Cloud workloadscan use 1 core for integerwork and the other the wholeFPU for number crunching
  8. 8. ISUM 2012, Guanajuato, Mexico Socket block diagram16 cores grouped in 8 compute units by core-pairsgrouped in 2 numanodes. Each numanode has 2 memorychannels. The numanodes are interconnected throughcHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W
  9. 9. ISUM 2012, Guanajuato, Mexico Bulldozer architecture (cont)• Flexible Floating Point Unit – Work that 1 core can do. 8 DP FLOPs/clk – Work that 2 cores can do. 4 DP FLOPs/clk • Example of DGEMM from ACML.• FMA4 and FMA3 instructions – FMA4 on Interlagos d = a (+/-) b*c – FMA3 on Abudhabi c = a (+/-) b*c• AVX instructions – Increase IPC by compacting instructions
  10. 10. Where are FMA instructions used ?
  11. 11. ISUM 2012, Guanajuato, Mexico Bulldozer architecture (cont)• Power management: – Maxpower (eg. 135W), TDP (115W), ACP (85W) – Power capping (to limit power consumption)• Boost states – Pstates (HW and SW views)• HPC mode (mostly for HPL benchmark)• Throttling – Power (too much power consumption, HPL) – Thermal (too hot, not enough cooling, protection)
  12. 12. ISUM 2012, Guanajuato, Mexico Power management P0 Boost P-states P1 P2 P0 Base P-state Measured Dynamic Power P3 P1 120% P4 P2 TDP 100% P5 P3 POWER HEADROOM AVAILABLE FOR BOOST 80% P6 P4 P7 P5 60% 40%HW View SW View 20% 0% Tolf Applu HLT Wupwise MaxPower128 Galgel Lucas Crafty Vortex Sixtrack Eon Perlbmk Gzip Equake Bzip2 NOP Art Mcf Sim Vpr Parser Gcc Mesa Facerec Gap Mgrid Ammp Fma3d Apsi
  13. 13. ISUM 2012, Guanajuato, Mexico Coherent and non coherent fabric• Coherent Hypertransport fabric – Connects the numanodes with cache coherence • MOESI protocol – X8 cHT links, x16 cHT links – Scenic routing, reroutes traffic to make even x8 / x16 resources• Non Coherent Hypertransport – RD890 chipset (PCIegen2) – Connects the numanodes with PCI devices – multichipset
  14. 14. ISUM 2012, Guanajuato, MexicoCoherent and non coherent fabric
  15. 15. ISUM 2012, Guanajuato, Mexico Software Ecosystem• Operating Systems• Compilers – Open64, GCC, PGI• Math library – ACML, AMDlibM• Profilers – CodeAnalyst • Instruction Based Profiling
  16. 16. ISUM 2012, Guanajuato, Mexico Operating systems for Interlagos• Basic list of OS providing proper performance – Windows Server 2008 R2 – RHEL6.2 – CentOS 6.2 – SLES11sp2 – Scientific Linux 6.2Older versions need specific patches in order toperform.
  17. 17. ISUM 2012, Guanajuato, Mexico Compiler flags• Open64 version >= 4.2.5• GCC version >= 4.6• PGI version >= 11.9• Open64 and GCC – Compile/link flags: -Ofast -march=bdver1• PGI – Compile/link flags: -fast -tp Interlagos-64
  18. 18. ISUM 2012, Guanajuato, Mexico AMD Core Math Library,download @ developer.amd.com
  19. 19. ISUM 2012, Guanajuato, Mexico AMD Code Analyst Profiler,download @ developer.amd.com
  20. 20. ISUM 2012, Guanajuato, MexicoNUMA definition
  21. 21. ISUM 2012, Guanajuato, Mexico Feeding locally versus remotely• Locally 0 1 Channel 0 NUMA node 0 2 3 Channel 1Eg. 12GB/s• Remotely 0 1 Channel 0 NUMA node 0 2 3 Channel 1 Constrain cHT x8, x16 Higher latency (1 hop) bandwidthEg. 7GB/s at x16, 5GB/s at x8 0 1 Channel 0 NUMA node 1 2 3 Channel 1 21
  22. 22. ISUM 2012, Guanajuato, Mexico Affinity• numa [ctl/stat] tool (Linux)• Start tool (Windows)• HWLOC toolset (Windows, Linux) – www.open-mpi.org/projects/hwloc• LIKWID toolset (Windows, Linux) – http://code.google.com/p/likwid/• openMP environment variables – Eg. Open64: O64_OMP_AFFINITY_MAP• MPI runtime flags – Eg. OpenMPI: --bind-to-core
  23. 23. ISUM 2012, Guanajuato, Mexico numactl –hardware and numastatDetecting wrong BIOS settings configuration of system ,If NODE INTERLEAVED was ENABLED then it would only be 1 Physicalnuma node with core ids 0,1,2….30,31 and with 64 GB of memory. memory on numa node and how much is available (free) Core ids for numa node 3 Good, no misses 23
  24. 24. ISUM 2012, Guanajuato, Mexico EXAMPLE using likwid Hybrid MPI+OPenMP• Build application file and launch mpi job with hybrid openMP with 1 thread per compute unit on 2 . Using 4 compute nodes.• export OMP_NUM_THREADS=4• mpirun –app ./appfile,• Where appfile is Repeated core id for the binding of MPI process + 4 worker threads-h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application-h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application-h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application-h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application…………………………………………….-h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application-h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application-h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application-h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application 24
  25. 25. ISUM 2012, Guanajuato, Mexico Putting it all togetherPre-exascale (high computing density) system – Multicore – Multisocket – Multichipset – Multirail – MultiGPU – dynamically reconfigurable multi root PCI devices through workload analysis
  26. 26. ISUM 2012, Guanajuato, Mexico
  27. 27. ISUM 2012, Guanajuato, MexicoMore @ http://developer.amd.com• X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/)• AMD Developer Tools (http://developer.amd.com/tools/)• AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/• AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide (http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf)• AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/)• AMD HPC (www.amd.com/hpc)• AMD APP SDK Documentation (http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx)• Using the x86 Open64 Compiler Suite (http://developer.amd.com/tools/open64/Documents/open64.html)• x86 Open64 4.2.5.2 Release Notes (http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt)• ACML 5.0 Information (http://developer.amd.com/libraries/acml/features/pages/default.aspx)• Software Optimization Guide for “Bulldozer” processors (http://support.amd.com/us/Processor_TechDocs/47414.pdf)• AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions (http://support.amd.com/us/Embedded_TechDocs/43479.pdf)• Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core, 2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite. http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf

×