The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
1. POWER9 Features and Strategies for
improving Application Performance
on POWER9 with IBM XL and Open
Source compilers
Archana Ravindar
LLVM Compiler Performance(POWER Systems Performance), ISDL
aravind5@in.ibm.com
https://in.linkedin.com/in/archana-ravindar-0259625b
2. Scope of the Presentation
• Review POWER9 processor features
• Outline common bottlenecks encountered due to
certain program characteristics
• How to Identify these issues using tools on POWER9
Linux
• What compiler Options can be used to reduce the
impact of these characteristics
• How can we code programs that prevent such
situations to arise
• POWER Linux platform
• Compilers- XL, gcc wherever applicable
• Performance Tools- perf
3. POWER Processor Technology
Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
3
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive
Era− Enhanced Core and
Chip
Architecture Optimized
for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing
4. POWER9 Family – Deep Workload
Optimizations
Emerging Analytics, AI, Cognitive
- New core for stronger thread performance
- Delivers 2x compute resource per socket
- Built for acceleration – OpenPOWER solution enablement
Technical / HPC
- Highest bandwidth GPU attach
- Advanced GPU/CPU interaction and memory sharing
- High bandwidth direct attach memory
Cloud / HSDC
- Power / Packaging / Cost optimizations for a range of platforms
- Superior virtualization features: security, power management, QoS, interrupt
- State of the art IO technology for network and storage performance
Enterprise
- Large, flat, Scale-Up Systems
- Buffered memory for maximum capacity
- Leading RAS
- Improved caching
DB2 BLU
4
6. Shorter Pipelines with Reduced Disruption
Improved application performance for modern codes
• Shorten fetch to compute by 5 cycles
• Advanced branch prediction
Higher performance and pipeline utilization
• Improved instruction management
– Removed instruction grouping and reduced cracking
– Complete up to 128 (64 – SMT4 Core) instructions per cycle
Reduced latency and improved scalability
• Local pipe control of load/store operations
– Improved hazard avoidance
– Local recycles – reduced hazard disruption
– Improved lock management
POWER9 Core Pipeline Efficiency
6
7. 7
POWER ISA v3.0
Broader data type support
• 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications
• Expanded BCD and 128b Decimal Integer – For database and native analytics
• Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange
Support Emerging Algorithms
• Enhanced Arithmetic and SIMD
• Random Number Generation Instruction
Accelerate Emerging Workloads
• Memory Atomics – For high scale data-centric applications
• Hardware Assisted Garbage Collection – Optimize response time of interpretive languages
Cloud Optimization
• Enhanced Translation Architecture – Optimized for Linux
• New Interrupt Architecture – Automated partition routing for extreme virtualization
• Enhanced Accelerator Virtualization
• Hardware Enforced Trusted Execution
Energy & Frequency Management
• POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup
latency
New Instruction Set Architecture Implemented on POWER9
8. 8
Acceleration Super Highway
5.6x more data throughput vs. PCIe
Gen3
with NVIDIA NVLink optimization to the core
2x bandwidth
with PCIe Gen4 vs. PCIe Gen3
Access up to 2TB of system
memory
delivered with coherence … only on
POWER!
Superior data transfer to multiple
devices
25G Links to OpenCAPI GPU devices
GPU CPU and GPUGPU
speed-up
9. 9
Scope of the Compiler
Compiler is an important layer in the system stack that is crucial for application
performance
The compiler is intimately aware of the processor design and has functionality
implemented keeping in mind the various latencies of the hardware units and movement
of instructions within the pipe
The compiler is designed to emit appropriate ISA depending on which architecture a
program is compiled for
Based on the architecture scheduling is done to ensure smooth flow of instructions
through the pipe.
IBM XL is a proprietary compiler which was a pioneer in several optimization innovation
over the past 3 decades.
Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage
community participation and innovation.
The scope of this presentation focuses on how we can leverage IBM XL and open source
compilers to obtain optimum performance on POWER9
10. Tools that we use in the Discussion
• Compilers
– IBM proprietary compilers - xlC/xlc/xlf
– xlc -O[n] program.c –o program : n ranges from 0 to 5
– Some common options: -qhot (array intensive programs),
-qtune=pwr9, -qsimd (enable SIMD) etc
– Profile directed feedback (-qpdf1, -qpdf2)
– Open source compilers: GCC, LLVM
– -O[n]: n ranges from 0-3, Ofast
– Common options -march=power9
– Profile directed feedback (-fprofile-generate, -fprofile-use)
• Perf tool
– To record hotspots/profile application
• perf record -e r<code> ./binary args > out (produces perf.data)
• perf report (opens profile report stored in perf.data)
– To measure hardware events
• perf stat –e r<code> ./binary args > out
– For more details, refer perf manpage
11. Processor can be thought of containing two
components
•Front end ensures a smooth supply of instructions to be
executed to the Backend
•The Backend is concerned only with the execution of
instructions
•Code that has *too many* branches can cause processor to
fetch more instructions than required and affect performance
Front end Back end
12. Branches
• Branches are predicted much in advance as the time needed to resolve the condition takes time
introducing a bubble in the pipeline slowing down execution
• POWER9 has an advanced branch predictor that uses complex structures to track context-based
branch histories and does a very good job of predicting them accurately. However certain
applications which are coded in a complex way can continue to cause high mispredictions
• Wrong prediction- Misprediction
– Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED
– Use perf stat –e r<code> ./program arguments > out to collect various counters
• Branches are caused even by function calls, Such branches affect instruction cache locality and
increase instruction cache misses
– Counters to detect this: PM_L1_ICACHE_MISS
• Branches within loops hinder vectorization/SIMD opportunities
13. Guidelines to reduce branches
• Options to reduce loop /call branches
– #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches)
– (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return)
– Corresponding GCC/LLVM compiler option: -finline-functions
• Loop Versioning: Slow version (that contains branches) + Fast version of loops (that
does not contain branches) (Usually done automatically by compilers at higher levels
of optimization)
• Provide hints in source code to indicate the expected values of expressions appearing
in branch conditions (long __builtin_expect(long expression, long value);) (hint
whether branch is more likely to be taken/not)
• If-conversion: Remove simple branches wherever possible by coding patterns such as
if(val!=0) a=a+val; a+=val;
if(val==0) a=a+1; a+=(!val)
14. Register Spills
• In a RISC architecture, predominantly, instructions operate on
registers
– Load,store instructions used to transfer data from memory to registers
• When #live variables > #available registers, spill is performed
• 1 spill = 1 store + 1 load
• *Spilling hot variables can hit performance*
– Spills can cause Load Hit Stores (stores followed by load to the same
address which may cause a delay in the pipe depending on the
separating distance)
– Spills increase Path length, address arithmetic instructions
– Unnecessary reads/writes to memory
• Issues due to to spills detected in following counters- PM_LSU_FIN,
PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL,
PM_FXU_FIN
15. Guidelines to reduce spills
• Limit extensive unrolling/inlining that can cause long-live ranges of variables
– Best to leave the compiler to do the inlining using its own heuristics
• XL compiler option: -qcompact can help
• Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up
extra registers
• Use other register resources like SIMD registers if applicable, Use Vectorization wherever
applicable/Code such that compiler vectorizes automatically
• Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR
complement) which combines multiple math operations in a single instruction saving a register;
Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used
• (R3=R1 & !R2)
– R4=not (R2) R3=R1 andc R2
– R3= R1 and R4
16. Memory Unit
• Memory is organized in a hierarchy
• L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto
main memory
• Memory is most distant to the processor and slowest
• Data cache : stores data, instruction cache: stores instructions
• Data cache misses can stall load instructions in the pipeline causing a cascading
effect on all those instructions dependent on it
• Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1,
PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc
L1 $
(3 cyc)
L2 $
(15.5 cyc)
L3 $
(35.5 cyc) Memory
(74.5 ns)
17. Techniques to optimize memory performance
• Memory footprint reduction wherever possible
– If you have enums declared in your program, using –qenum=small allocates just
one byte to enums v/s 4 bytes that gets allocated by default
– Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible
• Hardware prefetching
– Controlled by DSCR settings
– ppc64_cpu --dscr=<n>
– Common DSCR configurations
• 0 (all default values)
• 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch)
• 1 (no prefetch)
• POWER8 tuning guide has a detailed description of DSCR settings
• Software prefetching
– Programmer inserted prefetch instructions __dcbt, __dcbtst
– Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value>
– Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays
– If you want to explicitly control prefetching via software, you can turn off
hardware prefetching using ppc64_cpu –dscr command(under root privileges)
18. 18
Flag Kind XL GCC/LLVM
Can be
simulated in
source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler
Increases register
pressure
Inlining
-
qinline=auto:level=
N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power9,
-mtune=power9
Turns on platform specific
tuning like ISA, scheduling
64bit
compilation -q64 -m64
Prefetching
-
qprefetch[=aggressi
ve] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are
not used
Link time
optimization -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
feedback -qpdf1, -qpdf2
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step
llvm-profdata
Enables hot path
optimizations Requires a training run
20. Summary
• Today we talked about
– Various performance issues that can occur in an application on POWER9 linux
– How to identify them ?
– What can we do to improve performance during compilation ?
– What can we do to improve performance while coding the application itself ?
• We saw that Power9 has the most comprehensive set of hardware counters that enable
analysts to understand applications of performance and get to the bottlenecks quickly
• We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM
have a diverse set of options tailored to different needs to get required performance
Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip.
POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data.
This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI.
The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth.
Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.
Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip.
POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data.
This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI.
The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth.
Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.