This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
Performance Portability Through Descriptive ParallelismJeff Larkin
This is a talk from the 2016 DOE Performance Portability workshop in Glendale AZ. The purpose of this talk is to explain the concept of descriptive parallel programming and why it is one way to provide performance portability to a variety of parallel architectures.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
In this talk I discuss some best practices for writing OpenMP for NVIDIA GPUs.
Video of this presentation can be found here: https://www.youtube.com/watch?v=9w_2tj2uD4M
C++ open positions and popularity remain high as media has recently, and there is a reason for that: from the many languages and platforms that developers have available today, C++ features uncontested capabilities in power and performance, allowing innovation outside the box (just think on action games, natural user interfaces or augmented reality, to mention some). In this talk you’ll see the new features and technologies that are coming with Visual C++ vNext, helping you build compelling applications with a renewed developer experience. Don’t miss it!!
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
Performance Portability Through Descriptive ParallelismJeff Larkin
This is a talk from the 2016 DOE Performance Portability workshop in Glendale AZ. The purpose of this talk is to explain the concept of descriptive parallel programming and why it is one way to provide performance portability to a variety of parallel architectures.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
In this talk I discuss some best practices for writing OpenMP for NVIDIA GPUs.
Video of this presentation can be found here: https://www.youtube.com/watch?v=9w_2tj2uD4M
C++ open positions and popularity remain high as media has recently, and there is a reason for that: from the many languages and platforms that developers have available today, C++ features uncontested capabilities in power and performance, allowing innovation outside the box (just think on action games, natural user interfaces or augmented reality, to mention some). In this talk you’ll see the new features and technologies that are coming with Visual C++ vNext, helping you build compelling applications with a renewed developer experience. Don’t miss it!!
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
To let the world know about our product, we check open-source projects. By the moment we have checked 245 projects. A side effect: we found 9574 errors and notified the authors about them.
In this talk I introduced Yampa, the AFRP framework in Haskell, and the Quake-like game made by it. The content convers the basic of Functional Reactive Programming, Haskell Arrow, Yampa itself, time-space leak, etc.
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)Igalia
By Andy Wingo.
Andy will talk about forthcoming iterator and generator in JS:
1. Generator and interator seen from a JS developer perspective. What it is, why should I care?
2. Generator and iteragtor seen by a JS engine developer perspective. What does it imply in term for C++, performance consideration, how different is it from what exists already...
3. What does it means to implement new features in V8 (question driven)
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
To let the world know about our product, we check open-source projects. By the moment we have checked 245 projects. A side effect: we found 9574 errors and notified the authors about them.
In this talk I introduced Yampa, the AFRP framework in Haskell, and the Quake-like game made by it. The content convers the basic of Functional Reactive Programming, Haskell Arrow, Yampa itself, time-space leak, etc.
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)Igalia
By Andy Wingo.
Andy will talk about forthcoming iterator and generator in JS:
1. Generator and interator seen from a JS developer perspective. What it is, why should I care?
2. Generator and iteragtor seen by a JS engine developer perspective. What does it imply in term for C++, performance consideration, how different is it from what exists already...
3. What does it means to implement new features in V8 (question driven)
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
I will discuss subtle asynchrony in two contexts. First, how do we bring asynchronous task parallelism to the Fortran language, without relying on threads or related concepts? Second, I will describe how asynchronous task parallelism emerges in NWChem via overdecomposition, without programmers thinking about tasks. This example demonstrates that many of the principles of asynchronous many task execution can be achieved without specialized runtime systems or programming abstractions.
Alpine Data Labs presents a deep dive into our implementation of Multinomial Logistic Regression with Apache Spark. Machine Learning Engineer DB Tsai takes us through the technical implementation details step by step. First, he explains how the state of the art Machine Learning on Hadoop is not doing fulfilling the promise of Big Data. Next, he explains how Spark is a perfect match for machine learning through their in-memory cache-ing capability demonstrating 100x performance improvement. Third, he takes us through each aspect of a multinomial logistic regression and how this is developed with Spark APIs. Fourth, he demonstrates an extension of MLOR and training parameters. Finally, he benchmarks MLOR with 11M rows, 123 features, 11% non-zero elements with a 5 node Hadoop cluster. Finally, he shows Alpine's unique visual environment with Spark and verifies the performance with the job tracker. In conclusion, Alpine supports the state of the art Cloudera and Pivotal Hadoop clusters and performances at a level that far exceeds its next nearest competitor.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Similar to Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs (20)
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
Fortran has long been the language of computational math and science and it has outlived many of the computer architectures on which it has been used. Modern Fortran must be able to run on modern, highly parallel, heterogeneous computer architectures. A significant number of Fortran programmers have had success programming for heterogeneous machines by pairing Fortran with the OpenACC language for directives-based parallel programming. This includes some of the most widely-used Fortran applications in the world, such as VASP and Gaussian. This presentation will discuss what makes OpenACC a good fit for Fortran programmers and what the OpenACC language is doing to promote the use of native language parallelism in Fortran, such as do concurrent and Co-arrays.
Video Recording: https://www.youtube.com/watch?v=OXZ_Wkae63Y
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.
This presentation was originally given at CUG 2013.
A Comparison of Accelerator Programming ModelsJeff Larkin
This talk compares the pros and cons of GPU programming with CUDA C, CUDA Fortran, PGI accelerator directives, CAPS HMPP directives, and OpenCL. It was presented on April 30, 2010 in the OLCF Seminar Series at Oak Ridge National Lab. The original presentation is also available on prezi (http://prezi.com/5ogkgcw9qske/) and a video of the presentation will be posted elsewhere once available.
These slides are a series of "best practices" for running on the Cray XT line of supercomputers. This talk was presented at the HPCMP meeting at SDSC on 11/5/2009
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
1. Jeff Larkin, August 2017 DOE Performance Portability Workshop
Early Results of OpenMP 4.5 Portability
on NVIDIA GPUs
2. 2
Background
Since the last performance portability workshop, several OpenMP
implementations for NVIDIA GPUs have emerged or matured
As of August 2017, can these implementations deliver on performance,
portability, and performance portability?
• Will OpenMP Target code be portable between compilers?
• Will OpenMP Target code be portable with the host?
I will compare results using 4 compilers: CLANG, Cray, GCC, and XL
8/23/2017
3. 3
OpenMP In Clang
Multi-vendor effort to implement OpenMP in Clang (including offloading)
Runtime based on open-sourced runtime from Intel.
Current status: much improved since last year!
Version used: clang/20170629
Compiler Options:
-O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME
8/23/2017
4. 4
OpenMP In Cray
Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to
market for NVIDIA GPUs.
Observation: Does not adhere to OpenMP as strictly as the others.
Version used: 8.5.5
Compiler Options: None Required
Note: Cray performance results were obtained on an X86 + P100 system, unlike the
other compilers. Only GPU performance is being compared.
8/23/2017
5. 5
OpenMP In GCC
Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs
Runtime also based on open-sourced runtime from Intel
Current status: Mature on CPU, Very immature on GPU
Version used: 7.1.1 20170718 (experimental)
Compiler Options:
-O3 -fopenmp -foffload="-lm"
8/23/2017
6. 6
OpenMP In XL
IBM’s compiler suite, which now includes offloading to NVIDIA GPUs.
Same(ish) runtime as CLANG, but compilation by IBM’s compiler
Version used: xl/20170727-beta
Compiler Options:
-O3 -qsmp -qoffload
8/23/2017
14. 16
Increasing Parallelism
Currently both our distributed and workshared parallelism comes from the same
loop.
• We could collapse them together
• We could move the PARALLEL to the inner loop
The COLLAPSE(N) clause
• Turns the next N loops into one, linearized loop.
• This will give us more parallelism to distribute, if we so choose.
8/23/2017
15. 17
Collapse
#pragma omp target teams distribute parallel for reduction(max:error) map(error)
collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Collapse the two loops
into one and then
parallelize this new
loop across both teams
and threads.
16. 18
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 41.812337 3.706288
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC XL
Data Kernels Other
17. 19
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 3.706288
0
0.5
1
1.5
2
2.5
3
3.5
4
CLANG Cray XL
Data Kernels Other
18. 20
Splitting Teams & Parallel
#pragma omp target teams distribute map(error)
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Distribute the “j” loop
over teams.
Workshare the “i” loop
over threads.
19. 21
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 49.474303 12.261814 14.559997
0
10
20
30
40
50
60
CLANG Cray GCC GCC simd XL
Data Kernels Other
20. 22
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 12.261814 14.559997
0
2
4
6
8
10
12
14
16
CLANG Cray GCC simd XL
Data Kernels Other
22. 24
Fallback to the Host Processor
Most OpenMP users would like to write 1 set of directives for host and device,
but is this really possible?
Using the “if” clause, offloading can be enabled/disabled at runtime.
#pragma omp target teams distribute parallel for reduction(max:error) map(error)
collapse(2) if(target:use_gpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Compiler must build CPU & GPU
codes and select at runtime.
23. 25
Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
0%
20%
40%
60%
80%
100%
120%
Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split
XLGCCCrayCLANG
25. 27
Conclusions
OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the
past year and are ready for real use.
• Will OpenMP Target code be portable between compilers?
Maybe. Compilers are of various levels of maturity. SIMD support/requirement
inconsistent.
• Will OpenMP Target code be portable with the host?
Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC
and Cray did poorly.