It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
Presentation of a paper accepted in Supercomputing Frontiers Asia 2022
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
A Memory Centric Fabric for a Data Centric World.
This presentation discusses OpenCAPI Interconnect Fabric and OMI Near Memory Bus Standards and why these are increasingly relevant in a Data Centric Computing World.
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
A Memory Centric Fabric for a Data Centric World.
This presentation discusses OpenCAPI Interconnect Fabric and OMI Near Memory Bus Standards and why these are increasingly relevant in a Data Centric Computing World.
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
AMD's EPYC Architecture has paved the way forward towards Heterogeneous Data Centric Computing, but it is still limited by it's parallel DDR interfaces. This presentation shows the potential for the EPYC architecture if it adopted the Open Memory Interface, OMI, for it's Near Memory interface.
Highlighted notes while studying Concurrent Data Structures:
DDR4 SDRAM
Source: Wikipedia
Double Data Rate 4 Synchronous Dynamic Random-Access Memory, officially abbreviated as DDR4 SDRAM, is a type of synchronous dynamic random-access memory with a high bandwidth ("double data rate") interface.
Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DFX Architecture for High-performance Multi-core MicroprocessorsIshwar Parulkar
This presentation was given at ITC 2008 (International Test Conference). It deals with DFX challenges and solution for high count multi-core microprocessors. Acknowledgment: Co-authors on ITC presentation - Gaurav Agarwal, Sriram Anandakumar, Gordon Liu, Rajesh Pendurkar, Krishna Rajan and Frank Chiu.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
AMD's EPYC Architecture has paved the way forward towards Heterogeneous Data Centric Computing, but it is still limited by it's parallel DDR interfaces. This presentation shows the potential for the EPYC architecture if it adopted the Open Memory Interface, OMI, for it's Near Memory interface.
Highlighted notes while studying Concurrent Data Structures:
DDR4 SDRAM
Source: Wikipedia
Double Data Rate 4 Synchronous Dynamic Random-Access Memory, officially abbreviated as DDR4 SDRAM, is a type of synchronous dynamic random-access memory with a high bandwidth ("double data rate") interface.
Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DFX Architecture for High-performance Multi-core MicroprocessorsIshwar Parulkar
This presentation was given at ITC 2008 (International Test Conference). It deals with DFX challenges and solution for high count multi-core microprocessors. Acknowledgment: Co-authors on ITC presentation - Gaurav Agarwal, Sriram Anandakumar, Gordon Liu, Rajesh Pendurkar, Krishna Rajan and Frank Chiu.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.
Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Heterogeneous computing is seen as a path forward to deliver the energy and performance improvements needed over the next decade. That way, heterogeneous systems feature GPUs (Graphics Processing Units) or FPGAs (Field Programmable Gate Arrays) that excel at accelerating complex tasks while consuming less energy. There are also heterogeneous architectures on-chip, like the processors developed for mobile devices (laptops, tablets and smartphones) comprised of multiple cores and a GPU.
This talk covers hardware and software aspects of this kind of heterogeneous architectures. Regarding the HW, we briefly discuss the underlying architecture of some heterogeneous chips composed of multicores+GPU and multicores+FPGA, delving into the differences between both kind of accelerators and how to measure the energy they consume. We also address the different solutions to get a coherent view of the memory shared between the cores and the GPU or between the cores and the FPGA.
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
Nesse webinar será apresentado o passo a passo de como criar projetos com Machine Learning utilizando ferramentas de terceiros como Sensi ML e Edge Impulse.
Tópicos que serão apresentados:
Kits de desenvolvimento para Machine Learning:
EV18H79A: SAMD21 ML Evaluation Kit with TDK 6-axis MEMS
EV45Y33A: SAMD21 ML Evaluation Kit with BOSCH IMU
SAMC21 xPlained Pro evaluation kit (ATSAMC21-XPRO) plus its QT8 xPlained Pro Extension Kit (AC164161)
Ferramentas de desenvolvimento:
MPLAB X
Data Visualizer
Ambiente de terceiros: Sensi ML e Edge Impulse
Coleta de dados
Como desenvolver um projeto usando Machine Learning sem conhecimentos específicos sobre o assunto e com conhecimentos sobre Machine Learning.
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
In this talk presented at the Spark London meetup on the 23rd of November 2016 I have detailed our findings in IBM's Runtime Technologies department around Apache Spark. I share best practices we observed by profiling Spark on a variety of workloads I have covered and help Spark users to profile their own applications. I've also touched on how anybody can develop using fast networking capabilities (RDMA) and can achieve substantial performance speedups using GPUs.
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
Presentation PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier, at the AMD Developer Summit (APU13) November 11-13, 2013.
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Cesar Maciel
Heterogeneous computing refers to systems that use more than one kind of processor and direct applications to run in the processor that is the most efficient for that specific task. Power Systems servers based on the POWER8 processor support several accelerators that are integrated into the system to improve the efficiency of an application.
This is a reupload of the talk I delivered at the Spark London Meetup group, November 2016. Original link to the event: https://www.meetup.com/Spark-London/events/235626954/
I share observations and best practices.
The Open Coherent Accelerator Processor Interface (OpenCAPI) is an industry-standard architecture targeted for emerging accelerator solutions and workloads. This session will address these following areas : 1.) The latest technology advancements surround OpenCAPI, 2.) The OpenCAPI strategy as it relates to the other industry acceleration standards. ie Intel's CXL, Gen-Z and CCIX, 3.) The open initiatives surrounding OMI and OpenCAPI 3.0 and GitHub, 4.) Industry Open Source Initiatives around OpenCAPI, 5.) OC-Accel - Our new FPGA programming framework, supporting OpenCAPI 3.0, targeting higher level programming languages such as C, C++ 6.) Interesting Use Cases
In this presentation during IBM Spectrum Scale user group meeting, I introduced the IO-500 benchmark. I discussed why it is necessary for the future HPC storage procurement and how to use it.
In this presentation during SC17 IO500 BOF, I did discuss about my experience using the IO-500 benchmark, how to use it and the IO analysis of the benchmarks
In this talk, a tool that handles Darshan data, called HarshaD is presented. The tool visualizes Darshan data without the user being aware of the procedure. Moreover, you can compare two Darshan output to understand how your modifications impact on IO.
In this presentation, you will learn how to use Lustre filesystem more efficient and the best practices. Moreover, Darshan tool will be used to analyze the IO.
This is the material of a Burst Buffer training that was presented for the early users program. It covers introduction about parallel I/O, Lustre, Darshan tool, Burst Buffer, and optimization parameters for MPI I/O.
Full video: https://youtu.be/8zLcZmiTweg
This workshop is a hands-on using Perftools from Cray with NAS Benchmarks step by step on Shaheen II. I present CrayPat, Apprentice2, and Reveal tools. There is a short introduction to Extrae/Paraver from Barcelona Supercomputing Centre.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
A tale of scale & speed: How the US Navy is enabling software delivery from l...
Evaluating GPU programming Models for the LUMI Supercomputer
1. Evaluating GPU Programming Models for the LUMI Supercomputer
Supercomputing Frontiers Asia
March 1st, 2022
George S. Markomanolis, Aksel Alpay, Jeffrey Young, Michael Klemm, Nicholas Malaya, Aniello
Esposito, Jussi Heikonen, Sergei Bastrakov, Alexander Debus, Thomas Kluge, Klaus Steiniger, Jan
Stephan, Rene Widera and Michael Bussmann
Lead HPC Scientist, CSC – IT Center For Science Ltd.
5. Programming models - HIP
• HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc)
• Hipify tools convert CUDA codes to HIP
5
6. Programming models - OpenMP Offloading
• Well known programming model, long history on CPUs
• GPU support since v4.0
• For AMD GPUs we use AMD OpenMP compiler (AOMP)
• With 120 compute units for MI100, the num_teams should be multiple of 120 and the thread_limits multiple
of 64 (wavefront in AMD)
6
7. Programming models - SYCL
• SYCL is an open standard for heterogeneous programming. It is developed and maintained by the Khronos
Group.
• It expresses heterogeneous data parallelism with pure C++. The latest SYCL version is SYCL 2020, which relies
on C++17.
• Starting with SYCL 2020, a generalized backend architecture was introduced that allows for other backends apart
from OpenCL. Backends used by current SYCL implementations include OpenCL, Level Zero, CUDA, HIP and
others.
• SYCL is using data parallel kernels, the execution is organized by a task graph and there are two memory
management models, the buffer-accessor and unified shared memory (USM) models.
• For this work we use the hipSYCL implementation, it supports NVIDIA/AMD/Intel GPUs among CPUs.
7
8. Programming models - Kokkos
• Kokkos Core implements a programming model in C++ for writing performance portable applications
targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data
management. (ECP/NNSA)
• The Kokkos abstraction layer maps C++ source code to the specific instructions required for the backend
during build time. When compiling the source code, the binary will be built for the declared backends:
oSerial backend, for serial code on a host device.
o Host-parallel backend, which executes in parallel on the host device (OpenMP API, etc.).
o Device-parallel backend, which offloads on a device, such as a GPU.
8
9. Programming models - OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality
• HPE is supporting OpenACC v2.7 for Fortran. This is quite old OpenACC version. HPE announced that they will
not support OpenACC for C/C++
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-project/tree/clacc/master OpenACC from LLVM only for
C (Fortran and C++ in the future)
oTranslate OpenACC to OpenMP Offloading
9
10. Programming models - Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for
accelerator development.
• Alpaka enables an accelerator-agnostic implementation of hierarchical redundant parallelism, a user can specify data
and task parallelism at multiple levels of compute and memory for a particular platform.
• In addition to grids, blocks, and threads, alpaka also provides an element construct that represents an n-dimensional
set of inputs that is amenable to vectorization by the compiler.
• Platform decided at the compile time, single source interface
• C++ User interface for the Platform independent Library Alpaka (CUPLA) makes it easy to port CUDA codes
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU), SYCL etc.
10
13. Benchmarks and Applications (I)
• BabelStream
A memory bound benchmark with many programming models implemented and five kernels
oadd (a[i]=b[i]+c[i])
omult (a[i]=b * c[i])
ocopy (a[i]=b[i])
otriad (a[i]=b[i]+d*c[i])
odot (sum=sum+a[i]*b[i])
oThe default problem size is $2^{25}$ FP64 operations and 100 iterations. We are evaluating
BabelStream v3.4 (6fe81e1). We present also the first results of BabelStream with the Alpaka
backend
13
14. Benchmarks and Applications (II)
• miniBUDE
We use also the mini-app called miniBUDE for the Bristol University Docking Engine (BUDE), a kernel of a
drug discovery application that is compute bound.
BUDE predicts the binding energy of the ligand with the target, however, there are many ways this bonding could
happen, and a variety of positions and rotations of the ligand relative to protein, known as poses, are explored.
For each pose, a number of properties are evaluated.
oIt supports various programming models but we do evaluate mainly CUDA and HIP.
oWe use the version with commit 1af5b39
• Both BabelStream and miniBUDE support many programming models such as HIP, CUDA, OpenMP
Offloading, SYCL, OpenACC, Kokkos.
14
15. Methodology and Hardware
• Compiling based on the available instructions from benchmark/application
• Execute 10 times and plot boxplots
• Tune where necessary based on the number of compute units on the GPU, wavefront/warp.
15
17. Results - BabelStream
17
Copy kernel
• AOMP lower performance compared to the
other programming models
• The default Kokkos implementation in
BabelStream does not provide tuning
options.
• Finally, hipSYCL has a variation on A100
which is less than 2%, and Alpaka achieves
a performance similar to HIP for the
MI100
• MI100 is 26-34% slower than A100 and
15-37% faster than V100
• While MI100 achieves 97.8-99.99% of the
peak performance except the OpenMP
results, the NVIDIAA100 achieves 96.3-
98.5% and V100 89-98.2%
18. Results - BabelStream
18
Multiply kernel
• For MI100, most of the programming
models achieve 96.63% and above, except
OpenMP, while for A100 all the
programming models perform close to the
peak, however, for V100, the not tuned
Kokkos underperforms.
• For MI100, most programming models
perform 17.8-37.8% faster than V100, and
similarly, MI100 is 25-31% slower than
A100.
19. Results - BabelStream
19
Add kernel
• The performance of Alpaka programming
model is quite close to HIP/CUDA for all
the devices with hipSYCL following
• The OpenMP is less efficient on the MI100
compared to the rest GPUs. Finally,
Kokkos’ performance, seems to be between
Alpaka and hipSYCL.
• MI100 is 14.74-21.60% faster than V100
and 28.55-32.86 slower than A100
• MI100 achieves 94.52-99.99% of its peak
while the rest GPUs 97.6-99.99%.
20. Results - BabelStream
20
Triad kernel
• For this kernel Alpaka performs equally to
HIP/CUDA, with following Kokkos, and
then hipSYCL and OpenMP.
• MI100 is 28.93-32.81% slower than A100
and 18.63-21.61% faster than V100
• MI100 achieves 94.62-99.94% of the peak
while the NVIDIA GPUs achieve at least
98%.
21. Results - BabelStream
21
Dot kernel
• In the first segment of V100, we have also a plot of
MI100 for hipSYCL as they were too close these
values.
• MI100 GPU is around 2.69-14.62% faster than NVIDIA
V100 except for OpenMP, and 28.00-69.69% slower
than A100.
• For HIP/CUDA; We define 216 blocks of threads for
the A100, as it has 108 streaming multiprocessors, and
improved the dot kernel performance by 8-10%.
• For hipSYCL; we utilize 960 blocks of threads and 256
threads per block for MI100 to achieve around 8%
faster than the default values.
• For Alpaka; we are able to tune also the dot kernel with
720 blocks of threads for MI100 and improve its
performance by 28% comparing to the default settings.
• OpenMP underperforms for AOMP, almost 50% slower
than V100
22. BabelStream Summary
• The peak bandwidth percentage for the OpenMP programming model is 42.16% - 94.68% for MI100
while it is at least 96% for the NVIDIA GPUs which demonstrates that AOMP needs further
development.
• For Kokkos, the range is 74-99% for MI100, 72.87-99% for the NVIDIA GPUs, where the non-
optimized version has lower performance mainly on MI100 and V100.
• hipSYCL achieves at least 96% of the HIP/CUDA performance except for MI100 and dot kernel that
achieves 89.53%.
• Finally, Alpaka achieves at least 97.2% for all the cases and demonstrates its performance.
• The OpenMP compiler performs better for NVIDIA GPUs regarding dot kernel, however, the version
that we used for AMD GPUs, is not the final product yet.
• Overall, the Alpaka performance is quite close to HIP, followed by hipSYCL and Kokkos, while
OpenMP can perform slower depending on the kernel.
22
23. Results - MiniBUDE
23
• Large problem sizes for miniBUDE are required to be
able to saturate the GPUs.
• For every experiment, we execute 8 iterations with
983040 poses, we calculate this number by tuning for
AMD MI100
• The miniBUDE provides in the output the single
precision GFLOP/s, and we observe that the AMD
MI100 GPU achieves a performance close to A100 by
2% and around 26% over V100.
• Regarding single precision, we tested also the mixbench
benchmark and the MI100 was 1.16 times faster than
A100, achieving both close to their peak performance.
24. Conclusion
• We presented a workflow for porting applications to LUMI supercomputer, an AMD GPU-based system
• We utilize a benchmark and a mini-app, which are memory and compute-bound respectively. We illustrate how
various programming models perform on these GPUs and what techniques can improve the performance for specific
cases.
• We presented the first results of BabelStream with Alpaka backend
• We discuss the lack of performance on some aspects of OpenMP
• HIP/CUDA perform quite good and most of the programming models are quite close, depending on the kernel
pattern. However, the users should be aware about other programming models either ones that underperform in
some cases like OpenMP for some patterns or the learning curve of Kokkos tuning
• Finally, programming models such as Alpaka and SYCL could be utilized as they support many backends, are
portable and for many kernels they provide similar performance to HIP.
• Support reproducibility, find how to reproduce experiments on NVIDIA V100/100, and AMD MI100:
https://zenodo.org/record/6307447
24
25. Future work
• We investigate to identify what the issue with some programming models and miniBUDE is.
• We want to analyze the OpenACC performance from the GCC and Cray Fortran compiler
• Test the new functionalities from the ROCm platform such as Heterogeneous Memory Management (HMM)
• We envision evaluating multi-GPU benchmarks and their scalability across multiple nodes
• By using LUMI we will be able to use the MI250X GPU and compare it with the current GPU generation, among
testing the packed FP32
• Finally, we are interested in benchmarking I/O from the GPUs memory as we have already hipified Elbencho
benchmark
25