In this talk I discuss some best practices for writing OpenMP for NVIDIA GPUs.
Video of this presentation can be found here: https://www.youtube.com/watch?v=9w_2tj2uD4M
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
Performance Portability Through Descriptive ParallelismJeff Larkin
This is a talk from the 2016 DOE Performance Portability workshop in Glendale AZ. The purpose of this talk is to explain the concept of descriptive parallel programming and why it is one way to provide performance portability to a variety of parallel architectures.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
Performance Portability Through Descriptive ParallelismJeff Larkin
This is a talk from the 2016 DOE Performance Portability workshop in Glendale AZ. The purpose of this talk is to explain the concept of descriptive parallel programming and why it is one way to provide performance portability to a variety of parallel architectures.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
Scaling Machine Learning Systems - (Braxton McKee, CEO & Founder, Ufora)
Braxton is the technical lead and founder of Ufora, a software company that develops Pyfora, an automatically parallel implementation of the Python programming language that enables data science and machine-learning at scale. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
Braxton will discuss scaling machine learning applications using the open-source platform Pyfora. He will describe both the general approach and also some specific engineering techniques employed in the implementation of Pyfora that make it possible to produce large-scale machine learning and data science programs directly from single-threaded Python code.
The release of TensorFlow 2.0 comes with a significant number of improvements over its 1.x version, all with a focus on ease of usability and a better user experience. We will give an overview of what TensorFlow 2.0 is and discuss how to get started building models from scratch using TensorFlow 2.0’s high-level api, Keras. We will walk through an example step-by-step in Python of how to build an image classifier. We will then showcase how to leverage a transfer learning to make building a model even easier! With transfer learning, we can leverage other pretrained models such as ImageNet to drastically speed up the training time of our model. TensorFlow 2.0 makes this incredibly simple to do.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team will provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial you’ll gain hands on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
Horovod makes it easy to train single-GPU TensorFlow model on many GPUs - both on a single server and across multiple servers. The talk will touch upon mechanisms of deep learning training, challenges that distributed deep learning poses, mechanics of Horovod, as well as practical steps necessary to train a deep learning model on your favorite cluster.
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
This is an introduction to polyaxon and why I use polyaxon.
Polyaxon enables me to leverage kubernetes to achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
Scaling Machine Learning Systems - (Braxton McKee, CEO & Founder, Ufora)
Braxton is the technical lead and founder of Ufora, a software company that develops Pyfora, an automatically parallel implementation of the Python programming language that enables data science and machine-learning at scale. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
Braxton will discuss scaling machine learning applications using the open-source platform Pyfora. He will describe both the general approach and also some specific engineering techniques employed in the implementation of Pyfora that make it possible to produce large-scale machine learning and data science programs directly from single-threaded Python code.
The release of TensorFlow 2.0 comes with a significant number of improvements over its 1.x version, all with a focus on ease of usability and a better user experience. We will give an overview of what TensorFlow 2.0 is and discuss how to get started building models from scratch using TensorFlow 2.0’s high-level api, Keras. We will walk through an example step-by-step in Python of how to build an image classifier. We will then showcase how to leverage a transfer learning to make building a model even easier! With transfer learning, we can leverage other pretrained models such as ImageNet to drastically speed up the training time of our model. TensorFlow 2.0 makes this incredibly simple to do.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team will provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial you’ll gain hands on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
Horovod makes it easy to train single-GPU TensorFlow model on many GPUs - both on a single server and across multiple servers. The talk will touch upon mechanisms of deep learning training, challenges that distributed deep learning poses, mechanics of Horovod, as well as practical steps necessary to train a deep learning model on your favorite cluster.
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
This is an introduction to polyaxon and why I use polyaxon.
Polyaxon enables me to leverage kubernetes to achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
Despite advances in software design and static analysis techniques, software remains incredibly complicated and difficult to reason about. Understanding highly-concurrent, kernel-level, and intentionally-obfuscated programs are among the problem domains that spawned the field of dynamic program analysis. More than mere debuggers, the challenge of dynamic analysis tools is to be able record, analyze, and replay execution without sacrificing performance. This talk will provide an introduction to the dynamic analysis research space and hopefully inspire you to consider integrating these techniques into your own internal tools.
Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
Exascale computing refers to a computing system which is capable to at least one exaflop in next couple of years. Many new programming models, architectures and algorithms have been introduced to attain the objective for exascale computing system. The primary objective is to enhance the system performance. In modern/super computers, GPU is being used to attain the high computing performance. However, it’s the objective of proposed technologies and programming models is almost same to make the GPU more powerful. But these technologies are still facing the number of challenges including parallelism, scale and complexity and also many more that must be fixed to achieve make computing system more powerful and efficient. In this paper, we have present a testing tool architecture for a parallel programming approach using two programming models as CUDA and OpenMP. Both CUDA and OpenMP could be used to program shared memory and GPU cores. The object of this architecture is to identify the static errors in the program that occurred during writing the code and cause absence of parallelism. Our architecture enforces the developers to write the feasible code through we can avoid from the essential errors in the program and run successfully.
Program Assignment Process ManagementObjective This program a.docxwkyra78
Program Assignment : Process Management
Objective: This program assignment is given to the Operating Systems course to allow the students to figure out how a single process (parent process) creates a child process and how they work on Unix/Linux(/Mac OS X/Windows) environment. Additionally, student should combine the code for describing inter-process communication into this assignment. Both parent and child processes interact with each other through shared memory-based communication scheme or message passing scheme.
Environment: Unix/Linux environment (VM Linux or Triton Server, or Mac OS X), Windows platform
Language: C or C++, Java
Requirements:
i. You have wide range of choices for this assignment. First, design your program to explain the basic concept of the process management in Unix Kernel. This main idea will be evolved to show your understanding on inter-process communication, file processing, etc.
ii. Refer to the following system calls:
- fork(), getpid(), family of exec(), wait(), sleep() system calls for process management
- shmget(), shmat(), shmdt(), shmctl() for shared memory support or
- msgget(), msgsnd(), msgrcv(), msgctl(), etc. for message passing support
iii. The program should present that two different processes, both parent and child, execute as they are supposed to.
iv. The output should contain the screen capture of the execution procedure of the program.
v. Interaction between parent and child processes can be provided through inter-process communication schemes, such as shared-memory or message passing schemes.
vi. Result should be organized as a document which explains the overview of your program, code, execution results, and the conclusion including justification of your program, lessons you've learned, comments, etc.
Note:
i. In addition, please try to understand how the local and global variables work across the processes
ii. read() or write () functions are used to understand how they work on the different processes.
iii. For extra credit, you can also incorporate advanced features, like socket or thread functions, into your code.
Examples:
1. Process Creation and IPC with Shared Memory Scheme
=============================================================
#include <stdio.h>
#include <sys/shm.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(){
pid_t pid;
int segment_id; //allocate the memory
char *shared_memory; //pointer to memory
const int size = 4096;
segment_id = shmget(IPC_PRIVATE, size, S_IRUSR | S_IWUSR);
shared_memory = (char *) shmat(segment_id, NULL, 0);
pid = fork();
if(pid < 0) { //error
fprintf(stderr, "Fork failed");
return 1;
}
else if(pid == 0){ //child process
char *child_shared_memory;
child_shared_memory = (char *) shmat(segment_id,NULL,0); //attach mem
sprintf(child_shared_memory, "Hello parent process!"); //write to the shared mem
shmdt(child_shared_memory);
}
else ...
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
The search for faster computing remains of great importance to the software community. Relatively inexpensive modern hardware, such as GPUs, allows users to run highly parallel code on thousands, or even millions of cores on distributed systems.
Building efficient GPU software is not a trivial task, often requiring a significant amount of engineering hours to attain the best performance. Similarly, distributed computing systems are inherently complex. In recent years, several libraries were developed to solve such problems. However, they often target a single aspect of computing, such as GPU computing with libraries like CuPy, or distributed computing with Dask.
Libraries like Dask and CuPy tend to provide great performance while abstracting away the complexity from non-experts, being great candidates for developers writing software for various different applications. Unfortunately, they are often difficult to be combined, at least efficiently.
With the recent introduction of NumPy community standards and protocols, it has become much easier to integrate any libraries that share the already well-known NumPy API. Such changes allow libraries like Dask, known for its easy-to-use parallelization and distributed computing capabilities, to defer some of that work to other libraries such as CuPy, providing users the benefits from both distributed and GPU computing with little to no change in their existing software built using the NumPy API.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
Flame graph is way to visualize profiling data that allows the most frequent code paths to be identified quickly and accurately. They can be generated using Brendan Gregg's open source programs on github.com/brendangregg/FlameGraph, which create interactive SVG files to be checked in browser. The source of profiling data does not really matter - it can be perf profiler, bpftrace, Performance Schema, EXPLAIN output or any other source that allows to convert the data into the expected format of comma-separated "path" plus metric per line.
Different types of Flame Graphs (CPU, Off-CPU, Memory, Differential etc) are presented. Various tools and approaches to collect profile information of different aspects of MySQL server internal working are presented Several real-life use cases where Flame Graphs helped to understand and solve the problem are discussed.
Similar to Best Practices for OpenMP on GPUs - OpenMP UK Users Group (20)
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
Fortran has long been the language of computational math and science and it has outlived many of the computer architectures on which it has been used. Modern Fortran must be able to run on modern, highly parallel, heterogeneous computer architectures. A significant number of Fortran programmers have had success programming for heterogeneous machines by pairing Fortran with the OpenACC language for directives-based parallel programming. This includes some of the most widely-used Fortran applications in the world, such as VASP and Gaussian. This presentation will discuss what makes OpenACC a good fit for Fortran programmers and what the OpenACC language is doing to promote the use of native language parallelism in Fortran, such as do concurrent and Co-arrays.
Video Recording: https://www.youtube.com/watch?v=OXZ_Wkae63Y
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.
This presentation was originally given at CUG 2013.
A Comparison of Accelerator Programming ModelsJeff Larkin
This talk compares the pros and cons of GPU programming with CUDA C, CUDA Fortran, PGI accelerator directives, CAPS HMPP directives, and OpenCL. It was presented on April 30, 2010 in the OLCF Seminar Series at Oak Ridge National Lab. The original presentation is also available on prezi (http://prezi.com/5ogkgcw9qske/) and a video of the presentation will be posted elsewhere once available.
These slides are a series of "best practices" for running on the Cray XT line of supercomputers. This talk was presented at the HPCMP meeting at SDSC on 11/5/2009
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
1. Tim Costa, NVIDIA HPC SW Product Manager
Jeff Larkin, Sr. DevTech Software Engineer
UK OpenMP Users Group, December 2020
BEST PRACTICES FOR
OPENMP ON GPUS
2. 2
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$omp target data map(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$omp end target data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
3. 3
THE ROLE OF DIRECTIVES FOR
PARALLEL PROGRAMMING
Serial Programming
Languages
Parallel
Programming
Languages
Directives convey additional information to the compiler.
4. 4
AVAILABLE NOW: THE NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud
Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
Open MPI
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
NVIDIA HPC SDK
5. 5
HPC COMPILERS
NVC | NVC++ | NVFORTRAN
Programmable
Standard Languages
Directives
CUDA
Multicore
Directives
Vectorization
Multi-Platform
x86_64
Arm
OpenPOWER
Accelerated
Latest GPUs
Automatic Acceleration
*+=
6. 6
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
7. 7
Expose More
Parallelism
Many codes already have Parallel Worksharing loops
(Parallel For/Parallel Do); Isn’t that enough?
Parallel For creates a single contention group of
threads with a shared view of memory and the ability to
coordinate and synchronize.
This structure limits the degree of parallelism that a GPU
can exploit and doesn’t exploit many GPU advantages
Parallel For Isn’t Enough
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
OMP PARALLEL
OMP FOR
ThreadTeam
8. 8
Expose More
Parallelism
The Teams Distribute directives exposes coarse-
grained, scalable parallelism, generating more parallelism.
Parallel For is still used to engage the threads within
each team.
Coarse and Fine-grained parallelism may be combined (A)
or split (B).
How do I know which will be better for my code &
machine?
Use Teams Distribute
#pragma omp target teams distribute
parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
A
#pragma omp target teams distribute
reduction(max:error)
for( int j = 1; j < n-1; j++) {
#pragma parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
B
OMP TEAMS
OMP DISTRIBUTE
OMP FOR
9. 9
Expose Even More
Parallelism!
Collapsing loops increases the parallelism available for the
compiler to exploit.
It’s likely possible for this code to perform equally well to
parallel for on the CPU and still exploit more parallelism
on a GPU.
Collapsing does erase possible gains from locality (maybe
tile can help?)
Not all loops can be collapsed
Use Collapse Clause
Aggressively
#pragma omp target teams distribute
parallel for reduction(max:error)
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] -
A[j][i]));
}
}
10. 10
Optimize Data
Motion
The more loops you move to the device, the more data
movement you may introduce.
Compilers will always choose correctness over
performance, so the programmer can often do better.
Target Data directives and map clauses enable the
programmer to take control of data movement.
On Unified Memory machines, locality may be good
enough, but optimizing the data movement is more
portable.
Note: I’ve seen cases where registering arrays with CUDA
may further improve remaining data movement.
Use Target Data Mapping
#pragma omp target data map(to:Anew) map(tofrom:A)
while ( error > tol && iter < iter_max )
{
… // Use A and Anew in target directives
}
11. 11
Accelerated
Libraries & OpenMP
Accelerated libraries are free performance
The use_device_ptr clause enables sharing data
from OpenMP to CUDA.
OpenMP 5.1 adds the interop construct to enable
sharing CUDA Streams
Eat your free lunch
#pragma omp target data map(alloc:x[0:n],y[0:n])
{
#pragma omp target teams distribute parallel for
for( i = 0; i < n; i++)
{
x[i] = 1.0f;
y[i] = 0.0f;
}
#pragma omp target data use_device_ptr(x,y)
{
cublasSaxpy(n, 2.0, x, 1, y, 1);
// Synchronize before using results
}
#pragma omp target update from(y[0:n])
}
12. 12
Tasking & GPUs
Data copies, GPU Compute, and CPU compute
can all overlap.
OpenMP uses its existing tasking framework to
manage asynchronous execution and
dependencies.
Your mileage may vary regarding how well the
runtime maps OpenMP tasks to CUDA streams.
Get it working synchronously first!
Keep the whole system busy
#pragma omp target data map(alloc:a[0:N],b[0:N])
{
#pragma omp target update to(a[0:N]) nowait depend(inout:a)
#pragma omp target update to(b[0:N]) nowait depend(inout:b)
#pragma omp target teams distribute parallel for
nowait depend(inout:a)
for(int i=0; i<N; i++ )
{
a[i] = 2.0f * a[i];
}
#pragma omp target teams distribute parallel for
nowait depend(inout:a,b)
for(int i=0; i<N; i++ )
{
b[i] = 2.0f * a[i];
}
#pragma omp target update from(b[0:N]) nowait depend(inout:b)
#pragma omp taskwait depend(inout:b)
}
13. 13
Host Fallback
The if clause can be used to selectively turn off
OpenMP offloading.
Unless the compiler knows at compile-time the
value of the if clause, both code paths are
generated
For simple loops, written to use the previous
guidelines, it may be possible to have unified
CPU & GPU code (no guarantee it will be
optimal)
The “if” clause
#pragma omp target teams distribute
parallel for reduction(max:error)
collapse(2) if(target:USE_GPU)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
14. 14
The Loop Directive
The Loop directive was added in 5.0 as a
descriptive option for programming.
Loop asserts the ability of a loop to be run in any
order, including concurrently.
As more compilers & applications adopt it, we
hope it will enable more performance
portability.
Give descriptiveness a try
#pragma omp target teams loop
reduction(max:error)
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
15. 15
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
Bonus: Give loop a try