SlideShare a Scribd company logo
2
SCOPE OF THE PRESENTATION
• Outline Tuning strategies to improve performance of programs on POWER9 processors
• Performance bottlenecks can arise in the processor front end and back end
• Lets discuss some of the bottlenecks and how we can work around them using compiler flags,
source code pragmas/attributes
• This talk refers to compiler options supported by open source compilers such as GCC. Latest
version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries
over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
POWER9 PROCESSOR
3
• Optimized for Stronger Thread Performance and Efficiency
• Increased Execution Bandwidth efficiency for a range of workloads including commercial,
cognitive and analytics
• Sophisticated instruction scheduling and branch prediction for unoptimized applications and
interpretive languages
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
4
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
• Shorter Pipelines with reduced disruption
• Improved Application Performance for Modern
Codes
• Higher Performance and Pipeline Utilization
• Removed instruction grouping
• Enhanced instruction fusion
• Pipeline can complete upto 128 (64-SMT4)
instructions /cycle
• Reduced Latency and Improved Scalability
• Improved pipe control of load/store
instructions
• Improved hazard avoidance
FORMAT OF TODAYS DISCUSSION
5
Brief presentation on optimization strategies
Followed by handson exercises
Initial steps -
>ssh –l student<n> orthus.nic.uoregon.edu
>ssh gorgon
Once you have a home directory make a directory with your name within the home/student<n>
>mkdir /home/student<n>/<yourname>
copy the following files into them
> cp -rf /home/users/gansys/archana/Handson .
You will see the following directories within Handson/
Task1/
Task2/
Task3/
Task4/
During the course of the presentation we will discuss the exercises inline and you can try them on the machine
6
PERFORMANCE TUNING IN THE FRONT-END
• Front end fetches and decodes the successive instructions and passes them to the backend for
processing
• POWER9 is a superscalar processor and is pipeline based so works with an advanced branch
predictor to predict the sequence and fetch instructions in advance
• We have call branches, loop branches
• Typically we use the following strategies to work around bottlenecks seen around branches –
• Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not
automatically)
• Converting control to data dependence using ?: and compiling with –misel for difficult to
predict branches
• Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling
• Indirect call promotion to promote more inlining
7
PERFORMANCE TUNING IN THE BACK-END
• Backend is concerned with executing of the instructions that were fetched and
dispatched to the appropriate units
• Compiler takes care of making sure dependent instructions are far from each other
in its scheduling pass automatically
• Tuning backend performance involves optimal usage of Processor
Resources. We can tune the performance using following.
• Registers- using instructions that reduce reg usage, Vectorization /
reducing pressure on GPRs/ ensuring more throughput, Making loops
free of pointers and branches as much as possible to enable more
vectorization
• Caches – data layout optimizations that reduce footprint, using –fshort-
enums, Prefetching – hardware and software
• System Tuning- parallelization, binding, largepages, optimized libraries
8
STRUCTURE OF HANDSON EXERCISE
• All the handson exercises work on the Jacobi application
• The application has two versions – poisson2d_reference (referred to as
poisson2d_serial in Task4) and poisson2d
• Inorder to showcase an optimization impact, poisson2d is optimized and
poisson2d_reference is minimally optimized to a baseline level and the performance
of the two routines are compared
• The application internally measures the time and prints the speedup
• Higher the speedup higher is the impact of the optimization in focus
• For the handson we work with gcc (9.2.0) and pgi compilers (19.10)
• Solutions are indicated in the Solutions/ folder within each of the Task directories
9
TASK1: BASIC COMPILER FLAGS
• Here the poisson2d_reference.c is optimized at O3 level
• The user needs to optimize poisson2d.c with Ofast level
• Build and run the application poisson2d
• What is the speedup you observe and why ?
• You can generate a perf profile using perf record –e cycles ./poisson2d
• Running perf report will show you the top routines and you can compare
performance of poisson2d_reference and poisson2d to get an idea
10
TASK2: SW PREFETCHING
• Now that we saw that Ofast improved performance beyond O3 lets optimize
poisson2d_reference at Ofast and see if we can further improve it
• The user needs to optimize the poisson2d with sw prefetching flag
• Build and run the application
• What is the speedup you observe?
• Verify whether sw prefetching instructions have been added
• Grep for dcbt in the objdump file
11
TASK3: OPENMP PARALLELIZATION
• The jacobi application is highly parallel
• We can using openMP pragmas parallelize it and measure the speedup
• The source file has openMP pragmas in comments
• Uncomment them and build with openMP options –fopenmp and link with –lgomp
• Run with multiple threads and note the speedup
• OMP_NUM_THREADS=4 ./poisson2d
• OMP_NUM_THREADS=16 ./poisson2d
• OMP_NUM_THREADS=32 ./poisson2d
• OMP_NUM_THREADS=64 ./poisson2d
12
TASK3.1: OPENMP PARALLELIZATION
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
13
TASK3.2: IMPACT OF BINDING
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
14
TASK4: ACCELERATE USING GPUS
• You can attempt this after the lecture on GPUs
• Jacobi application contains a large set of parallelizable loops
• Poisson2d.c contains commented openACC pragmas which should be
uncommented, built with appropriate flags and run on an accelerated platform
• #pragma acc parallel loop
• In case you want to refer to Solution - poisson2d.solution.c
• You can compare the speedup by running poisson2d without the pragmas and
running the poisson2d.solution
• For more information you can refer to the Makefile
15
TASK1: BASIC COMPILER FLAGS- SOLUTION
– This hands-on exercise illustrates the impact of the Ofast flag
– Ofast enables –ffast-math option that implements the same math function in a way
that does not require guarantees of IEEE / ISO rules or specification and avoids the
overhead of calling a function from the math library
– If you look at the perf profile, you will observe poisson2d_reference makes a call to
fmax
– Whereas poisson2d.c::main() of poisson2d generates native instructions such as
xvmax as it is optimized at Ofast
16
TASK2: SW PREFETCHING- SOLUTION
– Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst
instructions into the code if it is beneficial
– __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store
– POWER9 has prefetching enabled both at HW and SW levels
– At HW level, prefetching is “ON” by default
– At the SW level, you can request the compiler to insert prefetch
instructions ; However the compiler can choose to ignore the
request if it determines that it is not beneficial to do so.
– You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level
but not when
It is compiled at the O3 level
– That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the
conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax
– GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
17
TASK3.1: OPENMP PARALLELIZATION
• Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS
• [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d
• 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92
• [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65
• [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18
• Likewise if you bind threads across different cores you will see greater speedup
• [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20
• [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
18
TASK4: ACCELERATE USING GPUS
• Building and running poisson2d as it is, you will see no speedups
• [student02@gorgon Task4]$ make poisson2d
• /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o
poisson2d_serial.o
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o -
o poisson2d
• [student02@gorgon Task4]$ ./poisson2d
• ….
• 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02
• If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will
accelerate by pushing the parallel portions to the GPU you will see a massive speedup
• [student02@gorgon Task4]$ make poisson2d.solution
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c
poisson2d_serial.o -o poisson2d.solution
• [student02@gorgon Task4]$ ./poisson2d.solution
• 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
19
•SUMMARY
• Today we talked about
• Tuning strategies pertaining to the various units in the POWER9 HW –
• Front-end, Back-end
• Some of these strategies were compiler flags, source code pragmas that
one can apply to see improved performance of their programs
• We also saw additional ways of improving performance such as parallelization,
binding etc
• Hopefully the associated handson exercises gave you a more practical experience
in applying these concepts in optimizing an application
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed
on sub optimal conditions
20
ACKUP
21
•
•
•
•
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
22
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
23
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
24
•
•
• 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES
•
•
•
•
•
•
•
25
Flag Kind XL GCC/LLVM
Can be simulated
in source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler Increases register pressure
Inlining -qinline=auto:level=N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power8,
-mtune=power9
Turns on platform specific
tuning
64bit
compilation-q64 -m64
Prefetching
-
qprefetch[=aggressiv
e] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are not
used
Link time
optimizatio
n -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step

More Related Content

What's hot

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
SeongJae Park
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
noor ul ain
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
Gichelle Amon
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
l xf
 
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009
HuaiEnTseng
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
Hammad Farooq
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
Marc Snir
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Intel® Software
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
Rafi Dar
 
Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution
Ramdas Mozhikunnath
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
SeongJae Park
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
Continuent
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
HarshitParkar6677
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Vliw
VliwVliw
Vliw
AJAL A J
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
Amit Kumar Rathi
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
Pragnya Dash
 

What's hot (20)

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
 
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
 
Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Vliw
VliwVliw
Vliw
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
 

Similar to OpenPOWER Application Optimization

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Lecture6
Lecture6Lecture6
Lecture6
tt_aljobory
 
Open Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFVOpen Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFV
Open Networking Perú (Opennetsoft)
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Pier Luca Lanzi
 
Lecture5
Lecture5Lecture5
Lecture5
tt_aljobory
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
Surinder Kaur
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
LemonReddy1
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
Federico Michele Facca
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
mary_ramsay
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 

Similar to OpenPOWER Application Optimization (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Lecture6
Lecture6Lecture6
Lecture6
 
Open Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFVOpen Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFV
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
 
Lecture5
Lecture5Lecture5
Lecture5
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 

OpenPOWER Application Optimization

  • 1.
  • 2. 2 SCOPE OF THE PRESENTATION • Outline Tuning strategies to improve performance of programs on POWER9 processors • Performance bottlenecks can arise in the processor front end and back end • Lets discuss some of the bottlenecks and how we can work around them using compiler flags, source code pragmas/attributes • This talk refers to compiler options supported by open source compilers such as GCC. Latest version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
  • 3. POWER9 PROCESSOR 3 • Optimized for Stronger Thread Performance and Efficiency • Increased Execution Bandwidth efficiency for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling and branch prediction for unoptimized applications and interpretive languages IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 4. 4 IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation • Shorter Pipelines with reduced disruption • Improved Application Performance for Modern Codes • Higher Performance and Pipeline Utilization • Removed instruction grouping • Enhanced instruction fusion • Pipeline can complete upto 128 (64-SMT4) instructions /cycle • Reduced Latency and Improved Scalability • Improved pipe control of load/store instructions • Improved hazard avoidance
  • 5. FORMAT OF TODAYS DISCUSSION 5 Brief presentation on optimization strategies Followed by handson exercises Initial steps - >ssh –l student<n> orthus.nic.uoregon.edu >ssh gorgon Once you have a home directory make a directory with your name within the home/student<n> >mkdir /home/student<n>/<yourname> copy the following files into them > cp -rf /home/users/gansys/archana/Handson . You will see the following directories within Handson/ Task1/ Task2/ Task3/ Task4/ During the course of the presentation we will discuss the exercises inline and you can try them on the machine
  • 6. 6 PERFORMANCE TUNING IN THE FRONT-END • Front end fetches and decodes the successive instructions and passes them to the backend for processing • POWER9 is a superscalar processor and is pipeline based so works with an advanced branch predictor to predict the sequence and fetch instructions in advance • We have call branches, loop branches • Typically we use the following strategies to work around bottlenecks seen around branches – • Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not automatically) • Converting control to data dependence using ?: and compiling with –misel for difficult to predict branches • Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling • Indirect call promotion to promote more inlining
  • 7. 7 PERFORMANCE TUNING IN THE BACK-END • Backend is concerned with executing of the instructions that were fetched and dispatched to the appropriate units • Compiler takes care of making sure dependent instructions are far from each other in its scheduling pass automatically • Tuning backend performance involves optimal usage of Processor Resources. We can tune the performance using following. • Registers- using instructions that reduce reg usage, Vectorization / reducing pressure on GPRs/ ensuring more throughput, Making loops free of pointers and branches as much as possible to enable more vectorization • Caches – data layout optimizations that reduce footprint, using –fshort- enums, Prefetching – hardware and software • System Tuning- parallelization, binding, largepages, optimized libraries
  • 8. 8 STRUCTURE OF HANDSON EXERCISE • All the handson exercises work on the Jacobi application • The application has two versions – poisson2d_reference (referred to as poisson2d_serial in Task4) and poisson2d • Inorder to showcase an optimization impact, poisson2d is optimized and poisson2d_reference is minimally optimized to a baseline level and the performance of the two routines are compared • The application internally measures the time and prints the speedup • Higher the speedup higher is the impact of the optimization in focus • For the handson we work with gcc (9.2.0) and pgi compilers (19.10) • Solutions are indicated in the Solutions/ folder within each of the Task directories
  • 9. 9 TASK1: BASIC COMPILER FLAGS • Here the poisson2d_reference.c is optimized at O3 level • The user needs to optimize poisson2d.c with Ofast level • Build and run the application poisson2d • What is the speedup you observe and why ? • You can generate a perf profile using perf record –e cycles ./poisson2d • Running perf report will show you the top routines and you can compare performance of poisson2d_reference and poisson2d to get an idea
  • 10. 10 TASK2: SW PREFETCHING • Now that we saw that Ofast improved performance beyond O3 lets optimize poisson2d_reference at Ofast and see if we can further improve it • The user needs to optimize the poisson2d with sw prefetching flag • Build and run the application • What is the speedup you observe? • Verify whether sw prefetching instructions have been added • Grep for dcbt in the objdump file
  • 11. 11 TASK3: OPENMP PARALLELIZATION • The jacobi application is highly parallel • We can using openMP pragmas parallelize it and measure the speedup • The source file has openMP pragmas in comments • Uncomment them and build with openMP options –fopenmp and link with –lgomp • Run with multiple threads and note the speedup • OMP_NUM_THREADS=4 ./poisson2d • OMP_NUM_THREADS=16 ./poisson2d • OMP_NUM_THREADS=32 ./poisson2d • OMP_NUM_THREADS=64 ./poisson2d
  • 12. 12 TASK3.1: OPENMP PARALLELIZATION • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 13. 13 TASK3.2: IMPACT OF BINDING • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 14. 14 TASK4: ACCELERATE USING GPUS • You can attempt this after the lecture on GPUs • Jacobi application contains a large set of parallelizable loops • Poisson2d.c contains commented openACC pragmas which should be uncommented, built with appropriate flags and run on an accelerated platform • #pragma acc parallel loop • In case you want to refer to Solution - poisson2d.solution.c • You can compare the speedup by running poisson2d without the pragmas and running the poisson2d.solution • For more information you can refer to the Makefile
  • 15. 15 TASK1: BASIC COMPILER FLAGS- SOLUTION – This hands-on exercise illustrates the impact of the Ofast flag – Ofast enables –ffast-math option that implements the same math function in a way that does not require guarantees of IEEE / ISO rules or specification and avoids the overhead of calling a function from the math library – If you look at the perf profile, you will observe poisson2d_reference makes a call to fmax – Whereas poisson2d.c::main() of poisson2d generates native instructions such as xvmax as it is optimized at Ofast
  • 16. 16 TASK2: SW PREFETCHING- SOLUTION – Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst instructions into the code if it is beneficial – __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store – POWER9 has prefetching enabled both at HW and SW levels – At HW level, prefetching is “ON” by default – At the SW level, you can request the compiler to insert prefetch instructions ; However the compiler can choose to ignore the request if it determines that it is not beneficial to do so. – You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level but not when It is compiled at the O3 level – That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax – GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
  • 17. 17 TASK3.1: OPENMP PARALLELIZATION • Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS • [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d • 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92 • [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65 • [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18 • Likewise if you bind threads across different cores you will see greater speedup • [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20 • [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
  • 18. 18 TASK4: ACCELERATE USING GPUS • Building and running poisson2d as it is, you will see no speedups • [student02@gorgon Task4]$ make poisson2d • /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o poisson2d_serial.o • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o - o poisson2d • [student02@gorgon Task4]$ ./poisson2d • …. • 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02 • If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will accelerate by pushing the parallel portions to the GPU you will see a massive speedup • [student02@gorgon Task4]$ make poisson2d.solution • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution • [student02@gorgon Task4]$ ./poisson2d.solution • 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
  • 19. 19 •SUMMARY • Today we talked about • Tuning strategies pertaining to the various units in the POWER9 HW – • Front-end, Back-end • Some of these strategies were compiler flags, source code pragmas that one can apply to see improved performance of their programs • We also saw additional ways of improving performance such as parallelization, binding etc • Hopefully the associated handson exercises gave you a more practical experience in applying these concepts in optimizing an application IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed on sub optimal conditions
  • 21. 21 • • • • • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 22. 22 • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 23. 23 • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 24. 24 • • • 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES • • • • • • •
  • 25. 25 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining -qinline=auto:level=N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power8, -mtune=power9 Turns on platform specific tuning 64bit compilation-q64 -m64 Prefetching - qprefetch[=aggressiv e] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimizatio n -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed -fprofile-generate and –fprofile-use LLVM has an intermediate step