OpenPOWER Webinar

POWER9 Features and Strategies for
improving Application Performance
on POWER9 with IBM XL and Open
Source compilers
Archana Ravindar
LLVM Compiler Performance(POWER Systems Performance), ISDL
aravind5@in.ibm.com
https://in.linkedin.com/in/archana-ravindar-0259625b

Scope of the Presentation
• Review POWER9 processor features
• Outline common bottlenecks encountered due to
certain program characteristics
• How to Identify these issues using tools on POWER9
Linux
• What compiler Options can be used to reduce the
impact of these characteristics
• How can we code programs that prevent such
situations to arise
• POWER Linux platform
• Compilers- XL, gcc wherever applicable
• Performance Tools- perf

POWER Processor Technology
Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
3
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive
Era− Enhanced Core and
Chip
Architecture Optimized
for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing

POWER9 Family – Deep Workload
Optimizations
Emerging Analytics, AI, Cognitive
- New core for stronger thread performance
- Delivers 2x compute resource per socket
- Built for acceleration – OpenPOWER solution enablement
Technical / HPC
- Highest bandwidth GPU attach
- Advanced GPU/CPU interaction and memory sharing
- High bandwidth direct attach memory
Cloud / HSDC
- Power / Packaging / Cost optimizations for a range of platforms
- Superior virtualization features: security, power management, QoS, interrupt
- State of the art IO technology for network and storage performance
Enterprise
- Large, flat, Scale-Up Systems
- Buffered memory for maximum capacity
- Leading RAS
- Improved caching
DB2 BLU
4

POWER9 Core Execution Slice Microarchitecture
128b
Super-slice
64b
Slice
POWER9 SMT8 Core
Modular Execution Slices
Re-factored Core Provides Improved Efficiency & Workload Alignment
• Enhanced pipeline efficiency with modular execution and intelligent pipeline
control
• Increased pipeline utilization with symmetric data-type engines: Fixed, Float,
128b, SIMD
• Shared compute resource optimizes data-type interchange
POWER8 SMT8 Core
5
POWER9 SMT4 Core

Shorter Pipelines with Reduced Disruption
Improved application performance for modern codes
• Shorten fetch to compute by 5 cycles
• Advanced branch prediction
Higher performance and pipeline utilization
• Improved instruction management
– Removed instruction grouping and reduced cracking
– Complete up to 128 (64 – SMT4 Core) instructions per cycle
Reduced latency and improved scalability
• Local pipe control of load/store operations
– Improved hazard avoidance
– Local recycles – reduced hazard disruption
– Improved lock management
POWER9 Core Pipeline Efficiency
6

7
POWER ISA v3.0
Broader data type support
• 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications
• Expanded BCD and 128b Decimal Integer – For database and native analytics
• Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange
Support Emerging Algorithms
• Enhanced Arithmetic and SIMD
• Random Number Generation Instruction
Accelerate Emerging Workloads
• Memory Atomics – For high scale data-centric applications
• Hardware Assisted Garbage Collection – Optimize response time of interpretive languages
Cloud Optimization
• Enhanced Translation Architecture – Optimized for Linux
• New Interrupt Architecture – Automated partition routing for extreme virtualization
• Enhanced Accelerator Virtualization
• Hardware Enforced Trusted Execution
Energy & Frequency Management
• POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup
latency
New Instruction Set Architecture Implemented on POWER9

8
Acceleration Super Highway
 5.6x more data throughput vs. PCIe
Gen3
with NVIDIA NVLink optimization to the core
 2x bandwidth
with PCIe Gen4 vs. PCIe Gen3
 Access up to 2TB of system
memory
delivered with coherence … only on
POWER!
 Superior data transfer to multiple
devices
25G Links to OpenCAPI GPU devices

GPU  CPU and GPUGPU
speed-up

9
Scope of the Compiler
 Compiler is an important layer in the system stack that is crucial for application
performance
 The compiler is intimately aware of the processor design and has functionality
implemented keeping in mind the various latencies of the hardware units and movement
of instructions within the pipe
 The compiler is designed to emit appropriate ISA depending on which architecture a
program is compiled for
 Based on the architecture scheduling is done to ensure smooth flow of instructions
through the pipe.
 IBM XL is a proprietary compiler which was a pioneer in several optimization innovation
over the past 3 decades.
 Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage
community participation and innovation.
 The scope of this presentation focuses on how we can leverage IBM XL and open source
compilers to obtain optimum performance on POWER9

Tools that we use in the Discussion
• Compilers
– IBM proprietary compilers - xlC/xlc/xlf
– xlc -O[n] program.c –o program : n ranges from 0 to 5
– Some common options: -qhot (array intensive programs),
-qtune=pwr9, -qsimd (enable SIMD) etc
– Profile directed feedback (-qpdf1, -qpdf2)
– Open source compilers: GCC, LLVM
– -O[n]: n ranges from 0-3, Ofast
– Common options -march=power9
– Profile directed feedback (-fprofile-generate, -fprofile-use)
• Perf tool
– To record hotspots/profile application
• perf record -e r<code> ./binary args > out (produces perf.data)
• perf report (opens profile report stored in perf.data)
– To measure hardware events
• perf stat –e r<code> ./binary args > out
– For more details, refer perf manpage

Processor can be thought of containing two
components
•Front end ensures a smooth supply of instructions to be
executed to the Backend
•The Backend is concerned only with the execution of
instructions
•Code that has *too many* branches can cause processor to
fetch more instructions than required and affect performance
Front end Back end

Branches
• Branches are predicted much in advance as the time needed to resolve the condition takes time
introducing a bubble in the pipeline slowing down execution
• POWER9 has an advanced branch predictor that uses complex structures to track context-based
branch histories and does a very good job of predicting them accurately. However certain
applications which are coded in a complex way can continue to cause high mispredictions
• Wrong prediction- Misprediction
– Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED
– Use perf stat –e r<code> ./program arguments > out to collect various counters
• Branches are caused even by function calls, Such branches affect instruction cache locality and
increase instruction cache misses
– Counters to detect this: PM_L1_ICACHE_MISS
• Branches within loops hinder vectorization/SIMD opportunities

Guidelines to reduce branches
• Options to reduce loop /call branches
– #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches)
– (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return)
– Corresponding GCC/LLVM compiler option: -finline-functions
• Loop Versioning: Slow version (that contains branches) + Fast version of loops (that
does not contain branches) (Usually done automatically by compilers at higher levels
of optimization)
• Provide hints in source code to indicate the expected values of expressions appearing
in branch conditions (long __builtin_expect(long expression, long value);) (hint
whether branch is more likely to be taken/not)
• If-conversion: Remove simple branches wherever possible by coding patterns such as
if(val!=0) a=a+val; a+=val;
if(val==0) a=a+1; a+=(!val)

Register Spills
• In a RISC architecture, predominantly, instructions operate on
registers
– Load,store instructions used to transfer data from memory to registers
• When #live variables > #available registers, spill is performed
• 1 spill = 1 store + 1 load
• *Spilling hot variables can hit performance*
– Spills can cause Load Hit Stores (stores followed by load to the same
address which may cause a delay in the pipe depending on the
separating distance)
– Spills increase Path length, address arithmetic instructions
– Unnecessary reads/writes to memory
• Issues due to to spills detected in following counters- PM_LSU_FIN,
PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL,
PM_FXU_FIN

Guidelines to reduce spills
• Limit extensive unrolling/inlining that can cause long-live ranges of variables
– Best to leave the compiler to do the inlining using its own heuristics
• XL compiler option: -qcompact can help
• Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up
extra registers
• Use other register resources like SIMD registers if applicable, Use Vectorization wherever
applicable/Code such that compiler vectorizes automatically
• Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR
complement) which combines multiple math operations in a single instruction saving a register;
Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used
• (R3=R1 & !R2)
– R4=not (R2) R3=R1 andc R2
– R3= R1 and R4

Memory Unit
• Memory is organized in a hierarchy
• L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto
main memory
• Memory is most distant to the processor and slowest
• Data cache : stores data, instruction cache: stores instructions
• Data cache misses can stall load instructions in the pipeline causing a cascading
effect on all those instructions dependent on it
• Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1,
PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc
L1 $
(3 cyc)
L2 $
(15.5 cyc)
L3 $
(35.5 cyc) Memory
(74.5 ns)

Techniques to optimize memory performance
• Memory footprint reduction wherever possible
– If you have enums declared in your program, using –qenum=small allocates just
one byte to enums v/s 4 bytes that gets allocated by default
– Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible
• Hardware prefetching
– Controlled by DSCR settings
– ppc64_cpu --dscr=<n>
– Common DSCR configurations
• 0 (all default values)
• 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch)
• 1 (no prefetch)
• POWER8 tuning guide has a detailed description of DSCR settings
• Software prefetching
– Programmer inserted prefetch instructions __dcbt, __dcbtst
– Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value>
– Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays
– If you want to explicitly control prefetching via software, you can turn off
hardware prefetching using ppc64_cpu –dscr command(under root privileges)

18
Flag Kind XL GCC/LLVM
Can be
simulated in
source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler
Increases register
pressure
Inlining
-
qinline=auto:level=
N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power9,
-mtune=power9
Turns on platform specific
tuning like ISA, scheduling
64bit
compilation -q64 -m64
Prefetching
-
qprefetch[=aggressi
ve] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are
not used
Link time
optimization -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
feedback -qpdf1, -qpdf2
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step
llvm-profdata
Enables hot path
optimizations Requires a training run

Summary
• Today we talked about
– Various performance issues that can occur in an application on POWER9 linux
– How to identify them ?
– What can we do to improve performance during compilation ?
– What can we do to improve performance while coding the application itself ?
• We saw that Power9 has the most comprehensive set of hardware counters that enable
analysts to understand applications of performance and get to the bottlenecks quickly
• We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM
have a diverse set of options tailored to different needs to get required performance

References
• POWER9 User Manual
• https://openpowerfoundation.org/?resource_lib=power9-
processor-users-manual
• IBM XL Compiler reference
http://www-01.ibm.com/support/docview.wss?uid=swg27036675
• POWER9 Raw event codes (Install libpfm)
• https://github.com/torvalds/linux/blob/master/arch/powerpc/perf
/power9-events-list.h
• GCC 9.2 manual
• https://devdocs.io/gcc~9/
• LLVM manual
• https://llvm.org/docs/CommandGuide/

OpenPOWER Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OpenPOWER Webinar

Similar to OpenPOWER Webinar (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

OpenPOWER Webinar

Editor's Notes