SlideShare a Scribd company logo
POWER9 Features and Strategies for
improving Application Performance
on POWER9 with IBM XL and Open
Source compilers
Archana Ravindar
LLVM Compiler Performance(POWER Systems Performance), ISDL
aravind5@in.ibm.com
https://in.linkedin.com/in/archana-ravindar-0259625b
Scope of the Presentation
• Review POWER9 processor features
• Outline common bottlenecks encountered due to
certain program characteristics
• How to Identify these issues using tools on POWER9
Linux
• What compiler Options can be used to reduce the
impact of these characteristics
• How can we code programs that prevent such
situations to arise
• POWER Linux platform
• Compilers- XL, gcc wherever applicable
• Performance Tools- perf
POWER Processor Technology
Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
3
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive
Era− Enhanced Core and
Chip
Architecture Optimized
for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing
POWER9 Family – Deep Workload
Optimizations
Emerging Analytics, AI, Cognitive
- New core for stronger thread performance
- Delivers 2x compute resource per socket
- Built for acceleration – OpenPOWER solution enablement
Technical / HPC
- Highest bandwidth GPU attach
- Advanced GPU/CPU interaction and memory sharing
- High bandwidth direct attach memory
Cloud / HSDC
- Power / Packaging / Cost optimizations for a range of platforms
- Superior virtualization features: security, power management, QoS, interrupt
- State of the art IO technology for network and storage performance
Enterprise
- Large, flat, Scale-Up Systems
- Buffered memory for maximum capacity
- Leading RAS
- Improved caching
DB2 BLU
4
POWER9 Core Execution Slice Microarchitecture
128b
Super-slice
64b
Slice
POWER9 SMT8 Core
Modular Execution Slices
Re-factored Core Provides Improved Efficiency & Workload Alignment
• Enhanced pipeline efficiency with modular execution and intelligent pipeline
control
• Increased pipeline utilization with symmetric data-type engines: Fixed, Float,
128b, SIMD
• Shared compute resource optimizes data-type interchange
POWER8 SMT8 Core
5
POWER9 SMT4 Core
Shorter Pipelines with Reduced Disruption
Improved application performance for modern codes
• Shorten fetch to compute by 5 cycles
• Advanced branch prediction
Higher performance and pipeline utilization
• Improved instruction management
– Removed instruction grouping and reduced cracking
– Complete up to 128 (64 – SMT4 Core) instructions per cycle
Reduced latency and improved scalability
• Local pipe control of load/store operations
– Improved hazard avoidance
– Local recycles – reduced hazard disruption
– Improved lock management
POWER9 Core Pipeline Efficiency
6
7
POWER ISA v3.0
Broader data type support
• 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications
• Expanded BCD and 128b Decimal Integer – For database and native analytics
• Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange
Support Emerging Algorithms
• Enhanced Arithmetic and SIMD
• Random Number Generation Instruction
Accelerate Emerging Workloads
• Memory Atomics – For high scale data-centric applications
• Hardware Assisted Garbage Collection – Optimize response time of interpretive languages
Cloud Optimization
• Enhanced Translation Architecture – Optimized for Linux
• New Interrupt Architecture – Automated partition routing for extreme virtualization
• Enhanced Accelerator Virtualization
• Hardware Enforced Trusted Execution
Energy & Frequency Management
• POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup
latency
New Instruction Set Architecture Implemented on POWER9
8
Acceleration Super Highway
 5.6x more data throughput vs. PCIe
Gen3
with NVIDIA NVLink optimization to the core
 2x bandwidth
with PCIe Gen4 vs. PCIe Gen3
 Access up to 2TB of system
memory
delivered with coherence … only on
POWER!
 Superior data transfer to multiple
devices
25G Links to OpenCAPI GPU devices

GPU  CPU and GPUGPU
speed-up
9
Scope of the Compiler
 Compiler is an important layer in the system stack that is crucial for application
performance
 The compiler is intimately aware of the processor design and has functionality
implemented keeping in mind the various latencies of the hardware units and movement
of instructions within the pipe
 The compiler is designed to emit appropriate ISA depending on which architecture a
program is compiled for
 Based on the architecture scheduling is done to ensure smooth flow of instructions
through the pipe.
 IBM XL is a proprietary compiler which was a pioneer in several optimization innovation
over the past 3 decades.
 Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage
community participation and innovation.
 The scope of this presentation focuses on how we can leverage IBM XL and open source
compilers to obtain optimum performance on POWER9
Tools that we use in the Discussion
• Compilers
– IBM proprietary compilers - xlC/xlc/xlf
– xlc -O[n] program.c –o program : n ranges from 0 to 5
– Some common options: -qhot (array intensive programs),
-qtune=pwr9, -qsimd (enable SIMD) etc
– Profile directed feedback (-qpdf1, -qpdf2)
– Open source compilers: GCC, LLVM
– -O[n]: n ranges from 0-3, Ofast
– Common options -march=power9
– Profile directed feedback (-fprofile-generate, -fprofile-use)
• Perf tool
– To record hotspots/profile application
• perf record -e r<code> ./binary args > out (produces perf.data)
• perf report (opens profile report stored in perf.data)
– To measure hardware events
• perf stat –e r<code> ./binary args > out
– For more details, refer perf manpage
Processor can be thought of containing two
components
•Front end ensures a smooth supply of instructions to be
executed to the Backend
•The Backend is concerned only with the execution of
instructions
•Code that has *too many* branches can cause processor to
fetch more instructions than required and affect performance
Front end Back end
Branches
• Branches are predicted much in advance as the time needed to resolve the condition takes time
introducing a bubble in the pipeline slowing down execution
• POWER9 has an advanced branch predictor that uses complex structures to track context-based
branch histories and does a very good job of predicting them accurately. However certain
applications which are coded in a complex way can continue to cause high mispredictions
• Wrong prediction- Misprediction
– Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED
– Use perf stat –e r<code> ./program arguments > out to collect various counters
• Branches are caused even by function calls, Such branches affect instruction cache locality and
increase instruction cache misses
– Counters to detect this: PM_L1_ICACHE_MISS
• Branches within loops hinder vectorization/SIMD opportunities
Guidelines to reduce branches
• Options to reduce loop /call branches
– #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches)
– (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return)
– Corresponding GCC/LLVM compiler option: -finline-functions
• Loop Versioning: Slow version (that contains branches) + Fast version of loops (that
does not contain branches) (Usually done automatically by compilers at higher levels
of optimization)
• Provide hints in source code to indicate the expected values of expressions appearing
in branch conditions (long __builtin_expect(long expression, long value);) (hint
whether branch is more likely to be taken/not)
• If-conversion: Remove simple branches wherever possible by coding patterns such as
if(val!=0) a=a+val; a+=val;
if(val==0) a=a+1; a+=(!val)
Register Spills
• In a RISC architecture, predominantly, instructions operate on
registers
– Load,store instructions used to transfer data from memory to registers
• When #live variables > #available registers, spill is performed
• 1 spill = 1 store + 1 load
• *Spilling hot variables can hit performance*
– Spills can cause Load Hit Stores (stores followed by load to the same
address which may cause a delay in the pipe depending on the
separating distance)
– Spills increase Path length, address arithmetic instructions
– Unnecessary reads/writes to memory
• Issues due to to spills detected in following counters- PM_LSU_FIN,
PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL,
PM_FXU_FIN
Guidelines to reduce spills
• Limit extensive unrolling/inlining that can cause long-live ranges of variables
– Best to leave the compiler to do the inlining using its own heuristics
• XL compiler option: -qcompact can help
• Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up
extra registers
• Use other register resources like SIMD registers if applicable, Use Vectorization wherever
applicable/Code such that compiler vectorizes automatically
• Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR
complement) which combines multiple math operations in a single instruction saving a register;
Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used
• (R3=R1 & !R2)
– R4=not (R2) R3=R1 andc R2
– R3= R1 and R4
Memory Unit
• Memory is organized in a hierarchy
• L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto
main memory
• Memory is most distant to the processor and slowest
• Data cache : stores data, instruction cache: stores instructions
• Data cache misses can stall load instructions in the pipeline causing a cascading
effect on all those instructions dependent on it
• Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1,
PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc
L1 $
(3 cyc)
L2 $
(15.5 cyc)
L3 $
(35.5 cyc) Memory
(74.5 ns)
Techniques to optimize memory performance
• Memory footprint reduction wherever possible
– If you have enums declared in your program, using –qenum=small allocates just
one byte to enums v/s 4 bytes that gets allocated by default
– Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible
• Hardware prefetching
– Controlled by DSCR settings
– ppc64_cpu --dscr=<n>
– Common DSCR configurations
• 0 (all default values)
• 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch)
• 1 (no prefetch)
• POWER8 tuning guide has a detailed description of DSCR settings
• Software prefetching
– Programmer inserted prefetch instructions __dcbt, __dcbtst
– Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value>
– Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays
– If you want to explicitly control prefetching via software, you can turn off
hardware prefetching using ppc64_cpu –dscr command(under root privileges)
18
Flag Kind XL GCC/LLVM
Can be
simulated in
source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler
Increases register
pressure
Inlining
-
qinline=auto:level=
N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power9,
-mtune=power9
Turns on platform specific
tuning like ISA, scheduling
64bit
compilation -q64 -m64
Prefetching
-
qprefetch[=aggressi
ve] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are
not used
Link time
optimization -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
feedback -qpdf1, -qpdf2
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step
llvm-profdata
Enables hot path
optimizations Requires a training run
19
Hands-On Reference
Summary
• Today we talked about
– Various performance issues that can occur in an application on POWER9 linux
– How to identify them ?
– What can we do to improve performance during compilation ?
– What can we do to improve performance while coding the application itself ?
• We saw that Power9 has the most comprehensive set of hardware counters that enable
analysts to understand applications of performance and get to the bottlenecks quickly
• We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM
have a diverse set of options tailored to different needs to get required performance
References
• POWER9 User Manual
• https://openpowerfoundation.org/?resource_lib=power9-
processor-users-manual
• IBM XL Compiler reference
http://www-01.ibm.com/support/docview.wss?uid=swg27036675
• POWER9 Raw event codes (Install libpfm)
• https://github.com/torvalds/linux/blob/master/arch/powerpc/perf
/power9-events-list.h
• GCC 9.2 manual
• https://devdocs.io/gcc~9/
• LLVM manual
• https://llvm.org/docs/CommandGuide/

More Related Content

What's hot

IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
Ganesan Narayanasamy
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
Ganesan Narayanasamy
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
Ganesan Narayanasamy
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
Ganesan Narayanasamy
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
 
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationEXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
Iosif Itkin
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
IBM Sverige
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
inside-BigData.com
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 

What's hot (20)

IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationEXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 

Similar to OpenPOWER Webinar

OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
Ganesan Narayanasamy
 
13 risc
13 risc13 risc
13 risc
dilip kumar
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
AmarDura2
 
13 risc
13 risc13 risc
13 risc
Anwal Mirza
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1
Prabhu Mali
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
Edhole.com
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
inside-BigData.com
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
Edhole.com
 
13 risc
13 risc13 risc
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
Ganesan Narayanasamy
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
mentoresd
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
Edhole.com
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
Syed Zaid Irshad
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
Gerwin Makanyanga
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
Young Alista
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
Subhasis Dash
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
Taha Malampatti
 
Processors selection
Processors selectionProcessors selection
Processors selection
Pradeep Shankhwar
 

Similar to OpenPOWER Webinar (20)

OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
13 risc
13 risc13 risc
13 risc
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
13 risc
13 risc13 risc
13 risc
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
 
Processors selection
Processors selectionProcessors selection
Processors selection
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
Ganesan Narayanasamy
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
Ganesan Narayanasamy
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
Ganesan Narayanasamy
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
Ganesan Narayanasamy
 
Open Hardware and Future Computing
Open Hardware and Future ComputingOpen Hardware and Future Computing
Open Hardware and Future Computing
Ganesan Narayanasamy
 
AI/Cloud Technology access
AI/Cloud Technology access AI/Cloud Technology access
AI/Cloud Technology access
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
 
Open Hardware and Future Computing
Open Hardware and Future ComputingOpen Hardware and Future Computing
Open Hardware and Future Computing
 
AI/Cloud Technology access
AI/Cloud Technology access AI/Cloud Technology access
AI/Cloud Technology access
 

Recently uploaded

Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 

Recently uploaded (20)

Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 

OpenPOWER Webinar

  • 1. POWER9 Features and Strategies for improving Application Performance on POWER9 with IBM XL and Open Source compilers Archana Ravindar LLVM Compiler Performance(POWER Systems Performance), ISDL aravind5@in.ibm.com https://in.linkedin.com/in/archana-ravindar-0259625b
  • 2. Scope of the Presentation • Review POWER9 processor features • Outline common bottlenecks encountered due to certain program characteristics • How to Identify these issues using tools on POWER9 Linux • What compiler Options can be used to reduce the impact of these characteristics • How can we code programs that prevent such situations to arise • POWER Linux platform • Compilers- XL, gcc wherever applicable • Performance Tools- perf
  • 3. POWER Processor Technology Roadmap 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm 3 Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ Built for the Cognitive Era− Enhanced Core and Chip Architecture Optimized for Emerging Workloads − Processor Family with Scale-Up and Scale-Out Optimized Silicon − Premier Platform for Accelerated Computing
  • 4. POWER9 Family – Deep Workload Optimizations Emerging Analytics, AI, Cognitive - New core for stronger thread performance - Delivers 2x compute resource per socket - Built for acceleration – OpenPOWER solution enablement Technical / HPC - Highest bandwidth GPU attach - Advanced GPU/CPU interaction and memory sharing - High bandwidth direct attach memory Cloud / HSDC - Power / Packaging / Cost optimizations for a range of platforms - Superior virtualization features: security, power management, QoS, interrupt - State of the art IO technology for network and storage performance Enterprise - Large, flat, Scale-Up Systems - Buffered memory for maximum capacity - Leading RAS - Improved caching DB2 BLU 4
  • 5. POWER9 Core Execution Slice Microarchitecture 128b Super-slice 64b Slice POWER9 SMT8 Core Modular Execution Slices Re-factored Core Provides Improved Efficiency & Workload Alignment • Enhanced pipeline efficiency with modular execution and intelligent pipeline control • Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD • Shared compute resource optimizes data-type interchange POWER8 SMT8 Core 5 POWER9 SMT4 Core
  • 6. Shorter Pipelines with Reduced Disruption Improved application performance for modern codes • Shorten fetch to compute by 5 cycles • Advanced branch prediction Higher performance and pipeline utilization • Improved instruction management – Removed instruction grouping and reduced cracking – Complete up to 128 (64 – SMT4 Core) instructions per cycle Reduced latency and improved scalability • Local pipe control of load/store operations – Improved hazard avoidance – Local recycles – reduced hazard disruption – Improved lock management POWER9 Core Pipeline Efficiency 6
  • 7. 7 POWER ISA v3.0 Broader data type support • 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications • Expanded BCD and 128b Decimal Integer – For database and native analytics • Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange Support Emerging Algorithms • Enhanced Arithmetic and SIMD • Random Number Generation Instruction Accelerate Emerging Workloads • Memory Atomics – For high scale data-centric applications • Hardware Assisted Garbage Collection – Optimize response time of interpretive languages Cloud Optimization • Enhanced Translation Architecture – Optimized for Linux • New Interrupt Architecture – Automated partition routing for extreme virtualization • Enhanced Accelerator Virtualization • Hardware Enforced Trusted Execution Energy & Frequency Management • POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency New Instruction Set Architecture Implemented on POWER9
  • 8. 8 Acceleration Super Highway  5.6x more data throughput vs. PCIe Gen3 with NVIDIA NVLink optimization to the core  2x bandwidth with PCIe Gen4 vs. PCIe Gen3  Access up to 2TB of system memory delivered with coherence … only on POWER!  Superior data transfer to multiple devices 25G Links to OpenCAPI GPU devices  GPU  CPU and GPUGPU speed-up
  • 9. 9 Scope of the Compiler  Compiler is an important layer in the system stack that is crucial for application performance  The compiler is intimately aware of the processor design and has functionality implemented keeping in mind the various latencies of the hardware units and movement of instructions within the pipe  The compiler is designed to emit appropriate ISA depending on which architecture a program is compiled for  Based on the architecture scheduling is done to ensure smooth flow of instructions through the pipe.  IBM XL is a proprietary compiler which was a pioneer in several optimization innovation over the past 3 decades.  Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage community participation and innovation.  The scope of this presentation focuses on how we can leverage IBM XL and open source compilers to obtain optimum performance on POWER9
  • 10. Tools that we use in the Discussion • Compilers – IBM proprietary compilers - xlC/xlc/xlf – xlc -O[n] program.c –o program : n ranges from 0 to 5 – Some common options: -qhot (array intensive programs), -qtune=pwr9, -qsimd (enable SIMD) etc – Profile directed feedback (-qpdf1, -qpdf2) – Open source compilers: GCC, LLVM – -O[n]: n ranges from 0-3, Ofast – Common options -march=power9 – Profile directed feedback (-fprofile-generate, -fprofile-use) • Perf tool – To record hotspots/profile application • perf record -e r<code> ./binary args > out (produces perf.data) • perf report (opens profile report stored in perf.data) – To measure hardware events • perf stat –e r<code> ./binary args > out – For more details, refer perf manpage
  • 11. Processor can be thought of containing two components •Front end ensures a smooth supply of instructions to be executed to the Backend •The Backend is concerned only with the execution of instructions •Code that has *too many* branches can cause processor to fetch more instructions than required and affect performance Front end Back end
  • 12. Branches • Branches are predicted much in advance as the time needed to resolve the condition takes time introducing a bubble in the pipeline slowing down execution • POWER9 has an advanced branch predictor that uses complex structures to track context-based branch histories and does a very good job of predicting them accurately. However certain applications which are coded in a complex way can continue to cause high mispredictions • Wrong prediction- Misprediction – Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED – Use perf stat –e r<code> ./program arguments > out to collect various counters • Branches are caused even by function calls, Such branches affect instruction cache locality and increase instruction cache misses – Counters to detect this: PM_L1_ICACHE_MISS • Branches within loops hinder vectorization/SIMD opportunities
  • 13. Guidelines to reduce branches • Options to reduce loop /call branches – #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches) – (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return) – Corresponding GCC/LLVM compiler option: -finline-functions • Loop Versioning: Slow version (that contains branches) + Fast version of loops (that does not contain branches) (Usually done automatically by compilers at higher levels of optimization) • Provide hints in source code to indicate the expected values of expressions appearing in branch conditions (long __builtin_expect(long expression, long value);) (hint whether branch is more likely to be taken/not) • If-conversion: Remove simple branches wherever possible by coding patterns such as if(val!=0) a=a+val; a+=val; if(val==0) a=a+1; a+=(!val)
  • 14. Register Spills • In a RISC architecture, predominantly, instructions operate on registers – Load,store instructions used to transfer data from memory to registers • When #live variables > #available registers, spill is performed • 1 spill = 1 store + 1 load • *Spilling hot variables can hit performance* – Spills can cause Load Hit Stores (stores followed by load to the same address which may cause a delay in the pipe depending on the separating distance) – Spills increase Path length, address arithmetic instructions – Unnecessary reads/writes to memory • Issues due to to spills detected in following counters- PM_LSU_FIN, PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL, PM_FXU_FIN
  • 15. Guidelines to reduce spills • Limit extensive unrolling/inlining that can cause long-live ranges of variables – Best to leave the compiler to do the inlining using its own heuristics • XL compiler option: -qcompact can help • Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up extra registers • Use other register resources like SIMD registers if applicable, Use Vectorization wherever applicable/Code such that compiler vectorizes automatically • Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR complement) which combines multiple math operations in a single instruction saving a register; Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used • (R3=R1 & !R2) – R4=not (R2) R3=R1 andc R2 – R3= R1 and R4
  • 16. Memory Unit • Memory is organized in a hierarchy • L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto main memory • Memory is most distant to the processor and slowest • Data cache : stores data, instruction cache: stores instructions • Data cache misses can stall load instructions in the pipeline causing a cascading effect on all those instructions dependent on it • Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1, PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc L1 $ (3 cyc) L2 $ (15.5 cyc) L3 $ (35.5 cyc) Memory (74.5 ns)
  • 17. Techniques to optimize memory performance • Memory footprint reduction wherever possible – If you have enums declared in your program, using –qenum=small allocates just one byte to enums v/s 4 bytes that gets allocated by default – Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible • Hardware prefetching – Controlled by DSCR settings – ppc64_cpu --dscr=<n> – Common DSCR configurations • 0 (all default values) • 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch) • 1 (no prefetch) • POWER8 tuning guide has a detailed description of DSCR settings • Software prefetching – Programmer inserted prefetch instructions __dcbt, __dcbtst – Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value> – Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays – If you want to explicitly control prefetching via software, you can turn off hardware prefetching using ppc64_cpu –dscr command(under root privileges)
  • 18. 18 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining - qinline=auto:level= N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power9, -mtune=power9 Turns on platform specific tuning like ISA, scheduling 64bit compilation -q64 -m64 Prefetching - qprefetch[=aggressi ve] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimization -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed feedback -qpdf1, -qpdf2 -fprofile-generate and –fprofile-use LLVM has an intermediate step llvm-profdata Enables hot path optimizations Requires a training run
  • 20. Summary • Today we talked about – Various performance issues that can occur in an application on POWER9 linux – How to identify them ? – What can we do to improve performance during compilation ? – What can we do to improve performance while coding the application itself ? • We saw that Power9 has the most comprehensive set of hardware counters that enable analysts to understand applications of performance and get to the bottlenecks quickly • We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM have a diverse set of options tailored to different needs to get required performance
  • 21. References • POWER9 User Manual • https://openpowerfoundation.org/?resource_lib=power9- processor-users-manual • IBM XL Compiler reference http://www-01.ibm.com/support/docview.wss?uid=swg27036675 • POWER9 Raw event codes (Install libpfm) • https://github.com/torvalds/linux/blob/master/arch/powerpc/perf /power9-events-list.h • GCC 9.2 manual • https://devdocs.io/gcc~9/ • LLVM manual • https://llvm.org/docs/CommandGuide/

Editor's Notes

  1. Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip. POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data. This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI. The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth. Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.
  2. Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip. POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data. This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI. The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth. Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.