This document outlines strategies for tuning program performance on POWER9 processors. It discusses how performance bottlenecks can arise in the processor front-end and back-end and describes some compiler flags, pragmas, and source code techniques for addressing these bottlenecks. These include techniques like unrolling, inlining, prefetching, parallelization with OpenMP, and leveraging GPUs with OpenACC. Hands-on exercises are provided to demonstrate applying these optimizations to a Jacobi application and measuring the performance impacts.
Next Generation MPICH: What to Expect - Lightweight Communication and MoreIntel® Software
MPICH is a widely used, open-source implementation of the message passing interface (MPI) standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This session discusses the current development activity with MPICH, including a close collaboration with teams at Intel. We showcase preparing MPICH-derived implementations for deployment on upcoming supercomputers like Aurora (from the Argonne Leadership Computing Facility), which is based on the Intel® Xeon Phi™ processor and Intel® Omni-Path Architecture (Intel® OPA).
This document discusses Java compilers and their impact on performance. It explains that Java uses a two-step compilation process to achieve both portability and speed. The first step compiles Java code to bytecode, while the second step just-in-time compiles the bytecode to native machine code. It describes how client-side compilers focus on fast startup times while server-side compilers emphasize long-term optimizations. Tiered compilation combines aspects of both. The document also introduces hotspot compilation, which optimizes frequently executed code sections.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
This document provides an overview of performance tuning with IBM XL C/C++ and Fortran compilers and libraries. It discusses identifying application hot spots and bottlenecks using profiling tools like gprof and perf. It also covers compiler optimization techniques including basic optimizations like inlining and redundancy detection as well as advanced optimizations like interprocedural analysis and whole program optimization. Loop transformations are highlighted as important for improving performance of numerical applications.
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
The document discusses superscalar processor architectures. It explains that superscalar architectures allow multiple instructions to be issued and executed simultaneously by utilizing several parallel pipelines. This is different from simple pipelining which can only execute one instruction per clock cycle. Superscalar processors improve performance by issuing instructions out-of-order while ensuring results are identical to sequential execution. However, limitations like resource conflicts, control dependencies, and data dependencies can prevent full parallelism.
This document discusses superscalar and super pipeline approaches to improving processor performance. Superscalar processors execute multiple independent instructions in parallel using multiple pipelines. Super pipelines break pipeline stages into smaller stages to reduce clock period and increase instruction throughput. While superscalar utilizes multiple parallel pipelines, super pipelines perform multiple stages per clock cycle in each pipeline. Super pipelines benefit from higher parallelism but also increase potential stalls from dependencies. Both approaches aim to maximize parallel instruction execution but face limitations from true data and other dependencies.
This document discusses superscalar processors, which can execute multiple instructions in parallel within a single processor. A superscalar processor improves performance by executing scalar instructions simultaneously. It consists of an instruction dispatch unit that routes decoded instructions to functional units, reservation stations that decouple instruction decoding from execution, and a reorder buffer that stores in-flight instructions and ensures they complete in program order. While superscalar processors can increase performance, they have limitations such as branch delays and complexity that limit scalability.
Next Generation MPICH: What to Expect - Lightweight Communication and MoreIntel® Software
MPICH is a widely used, open-source implementation of the message passing interface (MPI) standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This session discusses the current development activity with MPICH, including a close collaboration with teams at Intel. We showcase preparing MPICH-derived implementations for deployment on upcoming supercomputers like Aurora (from the Argonne Leadership Computing Facility), which is based on the Intel® Xeon Phi™ processor and Intel® Omni-Path Architecture (Intel® OPA).
This document discusses Java compilers and their impact on performance. It explains that Java uses a two-step compilation process to achieve both portability and speed. The first step compiles Java code to bytecode, while the second step just-in-time compiles the bytecode to native machine code. It describes how client-side compilers focus on fast startup times while server-side compilers emphasize long-term optimizations. Tiered compilation combines aspects of both. The document also introduces hotspot compilation, which optimizes frequently executed code sections.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
This document provides an overview of performance tuning with IBM XL C/C++ and Fortran compilers and libraries. It discusses identifying application hot spots and bottlenecks using profiling tools like gprof and perf. It also covers compiler optimization techniques including basic optimizations like inlining and redundancy detection as well as advanced optimizations like interprocedural analysis and whole program optimization. Loop transformations are highlighted as important for improving performance of numerical applications.
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
The document discusses superscalar processor architectures. It explains that superscalar architectures allow multiple instructions to be issued and executed simultaneously by utilizing several parallel pipelines. This is different from simple pipelining which can only execute one instruction per clock cycle. Superscalar processors improve performance by issuing instructions out-of-order while ensuring results are identical to sequential execution. However, limitations like resource conflicts, control dependencies, and data dependencies can prevent full parallelism.
This document discusses superscalar and super pipeline approaches to improving processor performance. Superscalar processors execute multiple independent instructions in parallel using multiple pipelines. Super pipelines break pipeline stages into smaller stages to reduce clock period and increase instruction throughput. While superscalar utilizes multiple parallel pipelines, super pipelines perform multiple stages per clock cycle in each pipeline. Super pipelines benefit from higher parallelism but also increase potential stalls from dependencies. Both approaches aim to maximize parallel instruction execution but face limitations from true data and other dependencies.
This document discusses superscalar processors, which can execute multiple instructions in parallel within a single processor. A superscalar processor improves performance by executing scalar instructions simultaneously. It consists of an instruction dispatch unit that routes decoded instructions to functional units, reservation stations that decouple instruction decoding from execution, and a reorder buffer that stores in-flight instructions and ensures they complete in program order. While superscalar processors can increase performance, they have limitations such as branch delays and complexity that limit scalability.
Understanding of linux kernel memory modelSeongJae Park
SeongJae Park introduces himself and his work contributing to the Linux kernel memory model documentation. He developed a guaranteed contiguous memory allocator and maintains the Korean translation of the kernel's memory barrier documentation. The document discusses how the increasing prevalence of multi-core processors requires careful programming to ensure correct parallel execution given relaxed memory ordering. It notes that compilers and CPUs optimize for instruction throughput over programmer goals, and memory accesses can be reordered in ways that affect correctness on multi-processors. Understanding the memory model is important for writing high-performance parallel code.
The document discusses scalar, superscalar, and superpipelined processors. A scalar processor executes one instruction at a time while a superscalar processor can execute multiple instructions per clock cycle by exploiting instruction-level parallelism. Superpipelined processors have shorter clock cycles than the time required for any operation, allowing them to issue one instruction per cycle but complete instructions faster than a scalar processor.
Superscalar and VLIW architectures can exploit instruction-level parallelism (ILP) by processing multiple instructions simultaneously. There are two main approaches: superscalar processors fetch and execute independent instructions in parallel using dependency checking, while very long instruction word (VLIW) architectures rely on compilers to group independent instructions into single long instructions. List scheduling and trace scheduling are algorithms used to schedule instructions for ILP. Trace scheduling works by identifying common code traces and scheduling basic blocks within the trace together.
Erlang and XMPP can be used together in several ways:
1. Erlang is well-suited for implementing XMPP servers due to its high concurrency and reliability. ejabberd is an example of a popular Erlang XMPP server.
2. The XMPP protocol can be used to connect Erlang applications and allow them to communicate over the XMPP network. Libraries like Jabberlang facilitate writing Erlang XMPP clients.
3. XMPP provides a flexible messaging backbone that can be extended using Erlang modules. This allows Erlang code to integrate with and enhance standard XMPP server functionality.
This document provides an overview of eBPF/BPF and instructions for creating an eBPF program from scratch. It begins with explaining what eBPF/BPF is, its history and main ideas. It then covers how to build an eBPF program from the Linux kernel source code, including prerequisites, compilation steps, and modifying the makefile. The document also discusses how to program an eBPF program manually, analyzing the eBPF program and loader. It concludes with a promise of a quick demo.
TensorRT is an NVIDIA tool that optimizes and accelerates deep learning models for production deployment. It performs optimizations like layer fusion, reduced precision from FP32 to FP16 and INT8, kernel auto-tuning, and multi-stream execution. These optimizations reduce latency and increase throughput. TensorRT automatically optimizes models by taking in a graph, performing optimizations, and outputting an optimized runtime engine.
The document discusses superscalar processors and provides details about the Pentium 4 architecture as an example of a superscalar CISC machine. It covers topics such as instruction issue policies, register renaming, branch prediction, and the 20 stage pipeline of the Pentium 4. The Pentium 4 decodes x86 instructions into micro-ops, allocates registers and resources out of order, and can dispatch up to 6 micro-ops per cycle to execution units.
An Introduction to the Formalised Memory Model for Linux KernelSeongJae Park
Linux kernel provides executable and formalized memory model. These slides describe the nature of parallel programming in the Linux kernel and what memory model is and why it is necessary and important for kernel programmers. The slides were used at KOSSCON 2018 (https://kosscon.kr/).
This document introduces programming models for high-performance computing (HPC). It establishes a taxonomy to classify programming models and systems. The main goals are to introduce the current prominent programming models, including message-passing, shared memory, and bulk synchronous models. The document also discusses that there is no single best solution and that there are trade-offs between different approaches. Implementation stacks and hardware architectures are reviewed to provide context on how programming models map to low-level execution.
gcma: guaranteed contiguous memory allocatorSeongJae Park
This document presents GCMA, a Guaranteed Contiguous Memory Allocator that improves upon the current Contiguous Memory Allocator (CMA) solution in Linux. CMA can have unpredictable latency and even fail when allocating contiguous memory, especially under memory pressure or with background workloads. GCMA guarantees fast latency for contiguous memory allocation, success of allocation, and reasonable memory utilization by using discardable memory as its secondary client instead of movable pages. Experimental results on a Raspberry Pi 2 show that GCMA has significantly faster allocation latency than CMA, keeps camera latency fast even with background workloads, and can improve overall system performance compared to CMA.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.
VLIW (Very Large Instruction Word) is an architecture that aims to achieve high performance through instruction level parallelism (ILP). It allows multiple independent operations to be specified per instruction. Unlike superscalar architectures, all scheduling is done statically by the compiler in VLIW. The compiler analyzes dependencies, extracts parallelism, and encodes parallel instructions into a single very long instruction word to be executed concurrently by the processor. This reduces hardware complexity compared to dynamic scheduling in superscalar chips.
Load and store instructions first generate an effective address, then perform address translation before accessing the data cache for load or store operations. For loads, the cache is read to return data, while stores write data to the cache. Stores are held in the store buffer until retirement to maintain load-store ordering. Loads can bypass and forward from earlier stores in the store buffer to improve performance. Memory dependencies between loads and stores are difficult to handle due to dynamic addresses and long memory latency. Speculative load disambiguation predicts dependencies to allow out-of-order execution when aliases are rare.
The document discusses the Linux kernel memory model (LKMM). It provides an overview of LKMM, including that it defines ordering rules for the Linux kernel due to weaknesses in the C language standard and need to support multiple hardware architectures. It describes ordering primitives like atomic operations and memory barriers provided by LKMM and how the LKMM was formalized into an executable model that can prove properties of parallel code against the LKMM.
Training Slides: Basics 102: Introduction to Tungsten ClusteringContinuent
This document provides an introduction to Continuent Tungsten clustering. It discusses key benefits like high availability, multi-site deployment, and ease of use. It examines the clustering architecture including topologies, automatic and manual failover, and rolling maintenance procedures. Commands for monitoring and managing the cluster are also reviewed, including cctrl and tpm diag. A demo shows using cctrl to perform a manual failover by promoting a slave to master.
A microprocessor is an electronic component that is used by a computer to do its work. It is a central processing unit on a single integrated circuit chip containing millions of very small components including transistors, resistors, and diodes that work together.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
This document discusses superscalar and VLIW architectures. Superscalar processors can execute multiple independent instructions in parallel by checking for dependencies between instructions. VLIW architectures package multiple operations into very long instruction words to execute in parallel on multiple functional units with scheduling done at compile-time rather than run-time. The document compares CISC, RISC, and VLIW instruction sets and outlines advantages and disadvantages of the VLIW approach.
A VLIW processor implements instruction level parallelism by grouping multiple operations into a single very long instruction word. The compiler statically schedules independent instructions to execute in parallel on functional units. This avoids the need for complex hardware to dynamically schedule instructions at runtime. VLIW moves the complexity to the compiler, allowing for simpler hardware that can be lower cost and lower power while achieving higher performance than RISC and CISC chips.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
This document discusses parallelizing a Coupled Cluster Singles and Doubles (CCSD) molecular dynamics application code using OpenMP to reduce its execution time on multi-core systems. Specifically, it identifies compute-intensive loops in the CCSD code for parallelization with OpenMP directives like PARALLEL DO. Performance evaluations show the optimized OpenMP version achieves a 35.66% reduction in wall clock time as the number of cores increases, demonstrating the effectiveness of the parallelization approach. Further improvements could involve a hybrid OpenMP-MPI model.
Understanding of linux kernel memory modelSeongJae Park
SeongJae Park introduces himself and his work contributing to the Linux kernel memory model documentation. He developed a guaranteed contiguous memory allocator and maintains the Korean translation of the kernel's memory barrier documentation. The document discusses how the increasing prevalence of multi-core processors requires careful programming to ensure correct parallel execution given relaxed memory ordering. It notes that compilers and CPUs optimize for instruction throughput over programmer goals, and memory accesses can be reordered in ways that affect correctness on multi-processors. Understanding the memory model is important for writing high-performance parallel code.
The document discusses scalar, superscalar, and superpipelined processors. A scalar processor executes one instruction at a time while a superscalar processor can execute multiple instructions per clock cycle by exploiting instruction-level parallelism. Superpipelined processors have shorter clock cycles than the time required for any operation, allowing them to issue one instruction per cycle but complete instructions faster than a scalar processor.
Superscalar and VLIW architectures can exploit instruction-level parallelism (ILP) by processing multiple instructions simultaneously. There are two main approaches: superscalar processors fetch and execute independent instructions in parallel using dependency checking, while very long instruction word (VLIW) architectures rely on compilers to group independent instructions into single long instructions. List scheduling and trace scheduling are algorithms used to schedule instructions for ILP. Trace scheduling works by identifying common code traces and scheduling basic blocks within the trace together.
Erlang and XMPP can be used together in several ways:
1. Erlang is well-suited for implementing XMPP servers due to its high concurrency and reliability. ejabberd is an example of a popular Erlang XMPP server.
2. The XMPP protocol can be used to connect Erlang applications and allow them to communicate over the XMPP network. Libraries like Jabberlang facilitate writing Erlang XMPP clients.
3. XMPP provides a flexible messaging backbone that can be extended using Erlang modules. This allows Erlang code to integrate with and enhance standard XMPP server functionality.
This document provides an overview of eBPF/BPF and instructions for creating an eBPF program from scratch. It begins with explaining what eBPF/BPF is, its history and main ideas. It then covers how to build an eBPF program from the Linux kernel source code, including prerequisites, compilation steps, and modifying the makefile. The document also discusses how to program an eBPF program manually, analyzing the eBPF program and loader. It concludes with a promise of a quick demo.
TensorRT is an NVIDIA tool that optimizes and accelerates deep learning models for production deployment. It performs optimizations like layer fusion, reduced precision from FP32 to FP16 and INT8, kernel auto-tuning, and multi-stream execution. These optimizations reduce latency and increase throughput. TensorRT automatically optimizes models by taking in a graph, performing optimizations, and outputting an optimized runtime engine.
The document discusses superscalar processors and provides details about the Pentium 4 architecture as an example of a superscalar CISC machine. It covers topics such as instruction issue policies, register renaming, branch prediction, and the 20 stage pipeline of the Pentium 4. The Pentium 4 decodes x86 instructions into micro-ops, allocates registers and resources out of order, and can dispatch up to 6 micro-ops per cycle to execution units.
An Introduction to the Formalised Memory Model for Linux KernelSeongJae Park
Linux kernel provides executable and formalized memory model. These slides describe the nature of parallel programming in the Linux kernel and what memory model is and why it is necessary and important for kernel programmers. The slides were used at KOSSCON 2018 (https://kosscon.kr/).
This document introduces programming models for high-performance computing (HPC). It establishes a taxonomy to classify programming models and systems. The main goals are to introduce the current prominent programming models, including message-passing, shared memory, and bulk synchronous models. The document also discusses that there is no single best solution and that there are trade-offs between different approaches. Implementation stacks and hardware architectures are reviewed to provide context on how programming models map to low-level execution.
gcma: guaranteed contiguous memory allocatorSeongJae Park
This document presents GCMA, a Guaranteed Contiguous Memory Allocator that improves upon the current Contiguous Memory Allocator (CMA) solution in Linux. CMA can have unpredictable latency and even fail when allocating contiguous memory, especially under memory pressure or with background workloads. GCMA guarantees fast latency for contiguous memory allocation, success of allocation, and reasonable memory utilization by using discardable memory as its secondary client instead of movable pages. Experimental results on a Raspberry Pi 2 show that GCMA has significantly faster allocation latency than CMA, keeps camera latency fast even with background workloads, and can improve overall system performance compared to CMA.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.
VLIW (Very Large Instruction Word) is an architecture that aims to achieve high performance through instruction level parallelism (ILP). It allows multiple independent operations to be specified per instruction. Unlike superscalar architectures, all scheduling is done statically by the compiler in VLIW. The compiler analyzes dependencies, extracts parallelism, and encodes parallel instructions into a single very long instruction word to be executed concurrently by the processor. This reduces hardware complexity compared to dynamic scheduling in superscalar chips.
Load and store instructions first generate an effective address, then perform address translation before accessing the data cache for load or store operations. For loads, the cache is read to return data, while stores write data to the cache. Stores are held in the store buffer until retirement to maintain load-store ordering. Loads can bypass and forward from earlier stores in the store buffer to improve performance. Memory dependencies between loads and stores are difficult to handle due to dynamic addresses and long memory latency. Speculative load disambiguation predicts dependencies to allow out-of-order execution when aliases are rare.
The document discusses the Linux kernel memory model (LKMM). It provides an overview of LKMM, including that it defines ordering rules for the Linux kernel due to weaknesses in the C language standard and need to support multiple hardware architectures. It describes ordering primitives like atomic operations and memory barriers provided by LKMM and how the LKMM was formalized into an executable model that can prove properties of parallel code against the LKMM.
Training Slides: Basics 102: Introduction to Tungsten ClusteringContinuent
This document provides an introduction to Continuent Tungsten clustering. It discusses key benefits like high availability, multi-site deployment, and ease of use. It examines the clustering architecture including topologies, automatic and manual failover, and rolling maintenance procedures. Commands for monitoring and managing the cluster are also reviewed, including cctrl and tpm diag. A demo shows using cctrl to perform a manual failover by promoting a slave to master.
A microprocessor is an electronic component that is used by a computer to do its work. It is a central processing unit on a single integrated circuit chip containing millions of very small components including transistors, resistors, and diodes that work together.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
This document discusses superscalar and VLIW architectures. Superscalar processors can execute multiple independent instructions in parallel by checking for dependencies between instructions. VLIW architectures package multiple operations into very long instruction words to execute in parallel on multiple functional units with scheduling done at compile-time rather than run-time. The document compares CISC, RISC, and VLIW instruction sets and outlines advantages and disadvantages of the VLIW approach.
A VLIW processor implements instruction level parallelism by grouping multiple operations into a single very long instruction word. The compiler statically schedules independent instructions to execute in parallel on functional units. This avoids the need for complex hardware to dynamically schedule instructions at runtime. VLIW moves the complexity to the compiler, allowing for simpler hardware that can be lower cost and lower power while achieving higher performance than RISC and CISC chips.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
This document discusses parallelizing a Coupled Cluster Singles and Doubles (CCSD) molecular dynamics application code using OpenMP to reduce its execution time on multi-core systems. Specifically, it identifies compute-intensive loops in the CCSD code for parallelization with OpenMP directives like PARALLEL DO. Performance evaluations show the optimized OpenMP version achieves a 35.66% reduction in wall clock time as the number of cores increases, demonstrating the effectiveness of the parallelization approach. Further improvements could involve a hybrid OpenMP-MPI model.
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
- OpenMP provides compiler directives and library calls to incrementally parallelize applications for shared memory multiprocessor systems. It works by allowing the master thread to spawn worker threads to perform work concurrently using directives like parallel and parallel do.
- Variables in OpenMP can be shared, private, or reduction. Shared variables are accessible by all threads while private variables have a separate copy for each thread. Reduction variables are used to combine values across threads.
- Synchronization is needed to coordinate thread access and ensure correct results. The barrier directive synchronizes threads at the end of parallel regions.
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPPier Luca Lanzi
This document provides an introduction to OpenMP, which is an application programming interface (API) used for shared memory multiprocessing programming in C, C++, and Fortran. It discusses key OpenMP concepts like parallel regions, work sharing constructs, data scope clauses, and runtime library routines. The document begins with an overview of OpenMP and its history, goals, programming model and basic elements. It then covers specific OpenMP constructs like parallel, for, sections and single, as well as data scope attributes like private, shared, default and reduction. Runtime functions for querying thread numbers and setting thread counts are also summarized.
Programming parallel computers can be done with shared memory or distributed memory models. Shared memory is easier since it has a single address space, while distributed memory requires managing multiple address spaces and remote data access. The dominant programming model is Single Program Multiple Data (SPMD) where the same code runs on all processors. OpenMP is used for shared memory and MPI is used for distributed memory. They involve directives/calls for parallelization and inter-processor communication. Multi-tiered systems can be programmed with MPI and OpenMP together or MPI alone.
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.
In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark.
We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation.
We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs.
At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.
OpenMP and MPI are two common APIs for parallel programming. OpenMP uses a shared memory model where threads have access to shared memory and can synchronize access. It is best for multi-core processors. MPI uses a message passing model where separate processes communicate by exchanging messages. It provides portability and is useful for distributed memory systems. Both have advantages like performance and portability but also disadvantages like difficulty of debugging for MPI. Future work may include improvements to threading support and fault tolerance in MPI.
This talk will provide several examples of how Facebook engineers use BPF to scale the networking, prevent denial of service, secure containers, analyze performance. It’s suitable for BPF newbies and experts.
Alexei Starovoitov, Facebook
This document provides an overview of the OpenMP course, including its objectives, topics covered, and motivation for OpenMP. The course objectives are to introduce the OpenMP standard and equip users to implement OpenMP constructs to realize performance improvements on shared memory machines. The course covers topics such as memory architectures, control constructs, worksharing, data scoping, synchronization, and performance optimization. It aims to explain how OpenMP provides a portable standard for shared memory parallel programming that addresses limitations of proprietary APIs and message passing approaches.
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
In this talk presented at the Spark London meetup on the 23rd of November 2016 I have detailed our findings in IBM's Runtime Technologies department around Apache Spark. I share best practices we observed by profiling Spark on a variety of workloads I have covered and help Spark users to profile their own applications. I've also touched on how anybody can develop using fast networking capabilities (RDMA) and can achieve substantial performance speedups using GPUs.
This document discusses Python web application development. It summarizes popular packages for web development with Flask including SQLAlchemy, Celery, and TensorFlow Model Server. It provides best practices for Flask, Celery, and Docker deployment. It also discusses profiling Python applications and handling signals in Docker containers.
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
Presentation of a paper accepted in Supercomputing Frontiers Asia 2022
This is a reupload of the talk I delivered at the Spark London Meetup group, November 2016. Original link to the event: https://www.meetup.com/Spark-London/events/235626954/
I share observations and best practices.
A presentation on how applying Cloud Architecture Patterns using Docker Swarm as orchestrator is possible to create reliable, resilient and scalable FIWARE platforms.
The CPU fetches instructions from memory, decodes them, and executes them. It has components like the ALU for arithmetic, registers for temporary storage, and a control unit for coordination. The program counter tracks the location of the next instruction to execute. Linkers incorporate subroutine addresses into programs. DLLs allow programs to access common code libraries to perform tasks like printing without loading the full code. Compilers translate to machine code for faster execution, while interpreters identify errors in real-time but run slower.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Similar to OpenPOWER Application Optimization (20)
The document describes a 5-day residency program hosted by the OpenPOWER Academic Discussion Group (ADG) at NIE Mysore from June 6-10, 2022. The program aims to bridge industry and academia knowledge in chip design by developing curriculum on OpenPOWER technology and training lab assistants. Engineers and academicians with 5+ years experience in chip design/verification are eligible to participate. They will collaborate on developing course materials and lab exercises to teach undergraduate students in fields like ECE and CSE. The program seeks to help fulfill India's goals in chip design manpower and self-reliance through initiatives like Make in India and the India Semiconductor Mission.
This document provides an overview of digital design and Verilog. It discusses binary numbers and boolean algebra as the foundation of digital systems. It also describes logic gates, combinational and sequential circuits, finite state machines, and datapath and control units. Finally, it introduces Verilog, describing different modeling types like gate level, behavioral, dataflow, and switch level modeling. It positions Verilog as a hardware description language used to more easily design digital circuits compared to manual drawing.
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
The document discusses AI in the enterprise, including use cases, infrastructure considerations, and the AI lifecycle. It provides examples of how AI can be applied in various industries and common patterns of analytics using AI. It also outlines the data science model development workflow and considerations for AI infrastructure, software, and data management throughout the AI lifecycle.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Introduction of Cybersecurity with OSS at Code Europe 2024
OpenPOWER Application Optimization
1.
2. 2
SCOPE OF THE PRESENTATION
• Outline Tuning strategies to improve performance of programs on POWER9 processors
• Performance bottlenecks can arise in the processor front end and back end
• Lets discuss some of the bottlenecks and how we can work around them using compiler flags,
source code pragmas/attributes
• This talk refers to compiler options supported by open source compilers such as GCC. Latest
version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries
over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
5. FORMAT OF TODAYS DISCUSSION
5
Brief presentation on optimization strategies
Followed by handson exercises
Initial steps -
>ssh –l student<n> orthus.nic.uoregon.edu
>ssh gorgon
Once you have a home directory make a directory with your name within the home/student<n>
>mkdir /home/student<n>/<yourname>
copy the following files into them
> cp -rf /home/users/gansys/archana/Handson .
You will see the following directories within Handson/
Task1/
Task2/
Task3/
Task4/
During the course of the presentation we will discuss the exercises inline and you can try them on the machine
6. 6
PERFORMANCE TUNING IN THE FRONT-END
• Front end fetches and decodes the successive instructions and passes them to the backend for
processing
• POWER9 is a superscalar processor and is pipeline based so works with an advanced branch
predictor to predict the sequence and fetch instructions in advance
• We have call branches, loop branches
• Typically we use the following strategies to work around bottlenecks seen around branches –
• Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not
automatically)
• Converting control to data dependence using ?: and compiling with –misel for difficult to
predict branches
• Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling
• Indirect call promotion to promote more inlining
7. 7
PERFORMANCE TUNING IN THE BACK-END
• Backend is concerned with executing of the instructions that were fetched and
dispatched to the appropriate units
• Compiler takes care of making sure dependent instructions are far from each other
in its scheduling pass automatically
• Tuning backend performance involves optimal usage of Processor
Resources. We can tune the performance using following.
• Registers- using instructions that reduce reg usage, Vectorization /
reducing pressure on GPRs/ ensuring more throughput, Making loops
free of pointers and branches as much as possible to enable more
vectorization
• Caches – data layout optimizations that reduce footprint, using –fshort-
enums, Prefetching – hardware and software
• System Tuning- parallelization, binding, largepages, optimized libraries
8. 8
STRUCTURE OF HANDSON EXERCISE
• All the handson exercises work on the Jacobi application
• The application has two versions – poisson2d_reference (referred to as
poisson2d_serial in Task4) and poisson2d
• Inorder to showcase an optimization impact, poisson2d is optimized and
poisson2d_reference is minimally optimized to a baseline level and the performance
of the two routines are compared
• The application internally measures the time and prints the speedup
• Higher the speedup higher is the impact of the optimization in focus
• For the handson we work with gcc (9.2.0) and pgi compilers (19.10)
• Solutions are indicated in the Solutions/ folder within each of the Task directories
9. 9
TASK1: BASIC COMPILER FLAGS
• Here the poisson2d_reference.c is optimized at O3 level
• The user needs to optimize poisson2d.c with Ofast level
• Build and run the application poisson2d
• What is the speedup you observe and why ?
• You can generate a perf profile using perf record –e cycles ./poisson2d
• Running perf report will show you the top routines and you can compare
performance of poisson2d_reference and poisson2d to get an idea
10. 10
TASK2: SW PREFETCHING
• Now that we saw that Ofast improved performance beyond O3 lets optimize
poisson2d_reference at Ofast and see if we can further improve it
• The user needs to optimize the poisson2d with sw prefetching flag
• Build and run the application
• What is the speedup you observe?
• Verify whether sw prefetching instructions have been added
• Grep for dcbt in the objdump file
11. 11
TASK3: OPENMP PARALLELIZATION
• The jacobi application is highly parallel
• We can using openMP pragmas parallelize it and measure the speedup
• The source file has openMP pragmas in comments
• Uncomment them and build with openMP options –fopenmp and link with –lgomp
• Run with multiple threads and note the speedup
• OMP_NUM_THREADS=4 ./poisson2d
• OMP_NUM_THREADS=16 ./poisson2d
• OMP_NUM_THREADS=32 ./poisson2d
• OMP_NUM_THREADS=64 ./poisson2d
12. 12
TASK3.1: OPENMP PARALLELIZATION
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
13. 13
TASK3.2: IMPACT OF BINDING
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
14. 14
TASK4: ACCELERATE USING GPUS
• You can attempt this after the lecture on GPUs
• Jacobi application contains a large set of parallelizable loops
• Poisson2d.c contains commented openACC pragmas which should be
uncommented, built with appropriate flags and run on an accelerated platform
• #pragma acc parallel loop
• In case you want to refer to Solution - poisson2d.solution.c
• You can compare the speedup by running poisson2d without the pragmas and
running the poisson2d.solution
• For more information you can refer to the Makefile
15. 15
TASK1: BASIC COMPILER FLAGS- SOLUTION
– This hands-on exercise illustrates the impact of the Ofast flag
– Ofast enables –ffast-math option that implements the same math function in a way
that does not require guarantees of IEEE / ISO rules or specification and avoids the
overhead of calling a function from the math library
– If you look at the perf profile, you will observe poisson2d_reference makes a call to
fmax
– Whereas poisson2d.c::main() of poisson2d generates native instructions such as
xvmax as it is optimized at Ofast
16. 16
TASK2: SW PREFETCHING- SOLUTION
– Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst
instructions into the code if it is beneficial
– __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store
– POWER9 has prefetching enabled both at HW and SW levels
– At HW level, prefetching is “ON” by default
– At the SW level, you can request the compiler to insert prefetch
instructions ; However the compiler can choose to ignore the
request if it determines that it is not beneficial to do so.
– You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level
but not when
It is compiled at the O3 level
– That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the
conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax
– GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
17. 17
TASK3.1: OPENMP PARALLELIZATION
• Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS
• [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d
• 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92
• [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65
• [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18
• Likewise if you bind threads across different cores you will see greater speedup
• [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20
• [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
18. 18
TASK4: ACCELERATE USING GPUS
• Building and running poisson2d as it is, you will see no speedups
• [student02@gorgon Task4]$ make poisson2d
• /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o
poisson2d_serial.o
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o -
o poisson2d
• [student02@gorgon Task4]$ ./poisson2d
• ….
• 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02
• If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will
accelerate by pushing the parallel portions to the GPU you will see a massive speedup
• [student02@gorgon Task4]$ make poisson2d.solution
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c
poisson2d_serial.o -o poisson2d.solution
• [student02@gorgon Task4]$ ./poisson2d.solution
• 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
24. 24
•
•
• 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES
•
•
•
•
•
•
•
25. 25
Flag Kind XL GCC/LLVM
Can be simulated
in source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler Increases register pressure
Inlining -qinline=auto:level=N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power8,
-mtune=power9
Turns on platform specific
tuning
64bit
compilation-q64 -m64
Prefetching
-
qprefetch[=aggressiv
e] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are not
used
Link time
optimizatio
n -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step