This document summarizes the speaker's work developing GPU rigid body simulation techniques over three generations. It discusses optimizations made to port various stages of rigid body simulation like collision detection and constraint solving to the GPU. These include using parallel sorting, grid structures, and graph coloring to parallelize pairs testing and constraint solving while avoiding write conflicts between GPU threads. The techniques have allowed achieving CPU-level rigid body simulation quality entirely on the GPU.
The document provides an introduction to GPU computing. It begins with a simple OpenGL program to demonstrate parallelism on the GPU. It then discusses shader programs and lighting effects. Next, it covers using the OpenGL pipeline and GLSL to perform computations by rendering to a framebuffer object. The document introduces compute shaders and compares CUDA and OpenCL programming models. It provides a code example and discusses resources for learning more about GPU computing.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
GPUs are specialized processors designed for graphics processing. CUDA (Compute Unified Device Architecture) allows general purpose programming on NVIDIA GPUs. CUDA programs launch kernels across a grid of blocks, with each block containing multiple threads that can cooperate. Threads have unique IDs and can access different memory types including shared, global, and constant memory. Applications that map well to this architecture include physics simulations, image processing, and other data-parallel workloads. The future of CUDA includes more general purpose uses through GPGPU and improvements in virtual memory, size, and cooling.
The document provides an introduction to GPU computing. It begins with a simple OpenGL program to demonstrate parallelism on the GPU. It then discusses shader programs and lighting effects. Next, it covers using the OpenGL pipeline and GLSL to perform computations by rendering to a framebuffer object. The document introduces compute shaders and compares CUDA and OpenCL programming models. It provides a code example and discusses resources for learning more about GPU computing.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
GPUs are specialized processors designed for graphics processing. CUDA (Compute Unified Device Architecture) allows general purpose programming on NVIDIA GPUs. CUDA programs launch kernels across a grid of blocks, with each block containing multiple threads that can cooperate. Threads have unique IDs and can access different memory types including shared, global, and constant memory. Applications that map well to this architecture include physics simulations, image processing, and other data-parallel workloads. The future of CUDA includes more general purpose uses through GPGPU and improvements in virtual memory, size, and cooling.
Highlighted notes of:
Chapter 9: Atomics
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
The document discusses optimizing DirectCompute for graphics processing units (GPUs). It introduces DirectCompute as an API for general purpose GPU programming and explains how it maps better than traditional graphics APIs to the architecture of GPUs like AMD's GCN. Key optimization techniques covered include choosing a thread group size that is a multiple of 64 threads to fully utilize the SIMD hardware, and using thread group shared memory to improve data access speeds compared to off-chip memory. Bank conflicts in shared memory access are also discussed as something to avoid for best performance.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Trends of SW Platforms for Heterogeneous Multi-core systems and Open Source ...Seunghwa Song
This document discusses trends in software platforms for heterogeneous multi-core systems and open source community activities. It describes how hardware limitations led to the rise of multi-core processors and classifications of multi-processor systems. Issues with programming heterogeneous systems are outlined along with solutions like OpenCL and efforts by organizations like ETRI and OpenSEED to develop software for these systems.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)Takahiro Harada
This document discusses using a heterogeneous CPU-GPU approach for particle-based simulations with particles of varying sizes. Large particles are handled by the CPU due to irregular workloads, while small particles are handled by the GPU due to uniform workloads. Optimizations include multi-threading large-small collisions on the CPU, spatially sorting small particles on the GPU to improve cache utilization, and load balancing work between the CPU and GPU. The approach leverages the strengths of the CPU and GPU on AMD's fusion architecture to efficiently simulate mixed particle systems.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
The document discusses performance analysis of 3D finite difference computational stencils on Seamicro fabric compute systems. It provides an overview of the hardware including chassis, compute cards, storage cards, and 3D torus fabric topology. It then describes the software stack and various microbenchmarks performed, including CPU, memory, network and storage benchmarks. It also describes modeling of 3D Laplace's equation using an 8th order finite difference scheme and its discretization over a 25 point stencil for computation on the system.
This document provides a high-level overview of GPU architecture, AMD and Nvidia GPU hardware, the OpenCL compilation system, and the installable client driver (ICD). It contrasts conventional CPU and GPU architectures, describes the SIMD and SIMT execution models, and examines key aspects of AMD's VLIW and Nvidia's scalar architectures like memory hierarchies and how they map to the OpenCL memory model. It stresses that understanding hardware can help optimize OpenCL code and provides guidelines for writing optimal GPU kernels.
Kazuaki Ishizaki is a research staff member at IBM Research - Tokyo who is interested in compiler optimizations, language runtimes, and parallel processing. He has worked on the Java virtual machine and just-in-time compiler for over 20 years. His message is that Spark can utilize GPUs to accelerate computation-heavy applications in a transparent way. He proposes new binary columnar and GPU enabler components that would efficiently store and handle data on GPUs without requiring changes to Spark programs. This could be implemented either through a Spark plugin for RDDs or by enhancing the Catalyst optimizer in Spark to generate GPU code.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
Keynote presentation, The Role of Java in Heterogeneous Computing, and How You Can Help, by Nandini Ramani, VP, Java Platform, Oracle Corporation, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
1. The document provides 15 suggestions for how to truly live life rather than just exist, including appreciating those who have hurt you, choosing to listen to your inner voice, embracing change, and letting go of the past instead of dwelling on mistakes.
2. Some key recommendations are to recognize the good in relationships despite people's imperfections, love others unconditionally, enjoy life's little moments, and accept situations that cannot be fixed instead of worrying about them.
3. The overall message is to appreciate each day, love freely, and find happiness by living consciously in the present moment rather than focusing on things that cannot be changed.
Presentation at Access International MarketsRyo Umezawa
Ryo Umezawa is a co-founder and COO of Pido Inc., a mobile startup in Tokyo. He has previously founded or held roles at several other technology companies in Japan. The document discusses opportunities in the Japanese mobile market, including strong smartphone adoption, mobile payment services with over 10 million users, and a mobile contents market worth over 500 billion yen annually. It also profiles Umezawa's experience and provides tips for foreign companies looking to enter the Japanese market.
Direct mail's key strength is the "mail moment" - 97% of consumers open their mailbox every day but only receive 2 direct mail pieces per week, so it has a high chance of being opened. Email checking is less frequent than opening physical mail. Direct mail represents 10.3% of gross media spending in Belgium. The challenges of email include spam, gaining attention, and multiple addresses versus the challenges of direct mail being cost, flexibility, and keeping databases updated. When measuring email campaigns, metrics include open rates, clickthroughs, and interactions versus direct mail where cost and targeting unique individuals are strengths.
Highlighted notes of:
Chapter 9: Atomics
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
The document discusses optimizing DirectCompute for graphics processing units (GPUs). It introduces DirectCompute as an API for general purpose GPU programming and explains how it maps better than traditional graphics APIs to the architecture of GPUs like AMD's GCN. Key optimization techniques covered include choosing a thread group size that is a multiple of 64 threads to fully utilize the SIMD hardware, and using thread group shared memory to improve data access speeds compared to off-chip memory. Bank conflicts in shared memory access are also discussed as something to avoid for best performance.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Trends of SW Platforms for Heterogeneous Multi-core systems and Open Source ...Seunghwa Song
This document discusses trends in software platforms for heterogeneous multi-core systems and open source community activities. It describes how hardware limitations led to the rise of multi-core processors and classifications of multi-processor systems. Issues with programming heterogeneous systems are outlined along with solutions like OpenCL and efforts by organizations like ETRI and OpenSEED to develop software for these systems.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)Takahiro Harada
This document discusses using a heterogeneous CPU-GPU approach for particle-based simulations with particles of varying sizes. Large particles are handled by the CPU due to irregular workloads, while small particles are handled by the GPU due to uniform workloads. Optimizations include multi-threading large-small collisions on the CPU, spatially sorting small particles on the GPU to improve cache utilization, and load balancing work between the CPU and GPU. The approach leverages the strengths of the CPU and GPU on AMD's fusion architecture to efficiently simulate mixed particle systems.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
The document discusses performance analysis of 3D finite difference computational stencils on Seamicro fabric compute systems. It provides an overview of the hardware including chassis, compute cards, storage cards, and 3D torus fabric topology. It then describes the software stack and various microbenchmarks performed, including CPU, memory, network and storage benchmarks. It also describes modeling of 3D Laplace's equation using an 8th order finite difference scheme and its discretization over a 25 point stencil for computation on the system.
This document provides a high-level overview of GPU architecture, AMD and Nvidia GPU hardware, the OpenCL compilation system, and the installable client driver (ICD). It contrasts conventional CPU and GPU architectures, describes the SIMD and SIMT execution models, and examines key aspects of AMD's VLIW and Nvidia's scalar architectures like memory hierarchies and how they map to the OpenCL memory model. It stresses that understanding hardware can help optimize OpenCL code and provides guidelines for writing optimal GPU kernels.
Kazuaki Ishizaki is a research staff member at IBM Research - Tokyo who is interested in compiler optimizations, language runtimes, and parallel processing. He has worked on the Java virtual machine and just-in-time compiler for over 20 years. His message is that Spark can utilize GPUs to accelerate computation-heavy applications in a transparent way. He proposes new binary columnar and GPU enabler components that would efficiently store and handle data on GPUs without requiring changes to Spark programs. This could be implemented either through a Spark plugin for RDDs or by enhancing the Catalyst optimizer in Spark to generate GPU code.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
Keynote presentation, The Role of Java in Heterogeneous Computing, and How You Can Help, by Nandini Ramani, VP, Java Platform, Oracle Corporation, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
1. The document provides 15 suggestions for how to truly live life rather than just exist, including appreciating those who have hurt you, choosing to listen to your inner voice, embracing change, and letting go of the past instead of dwelling on mistakes.
2. Some key recommendations are to recognize the good in relationships despite people's imperfections, love others unconditionally, enjoy life's little moments, and accept situations that cannot be fixed instead of worrying about them.
3. The overall message is to appreciate each day, love freely, and find happiness by living consciously in the present moment rather than focusing on things that cannot be changed.
Presentation at Access International MarketsRyo Umezawa
Ryo Umezawa is a co-founder and COO of Pido Inc., a mobile startup in Tokyo. He has previously founded or held roles at several other technology companies in Japan. The document discusses opportunities in the Japanese mobile market, including strong smartphone adoption, mobile payment services with over 10 million users, and a mobile contents market worth over 500 billion yen annually. It also profiles Umezawa's experience and provides tips for foreign companies looking to enter the Japanese market.
Direct mail's key strength is the "mail moment" - 97% of consumers open their mailbox every day but only receive 2 direct mail pieces per week, so it has a high chance of being opened. Email checking is less frequent than opening physical mail. Direct mail represents 10.3% of gross media spending in Belgium. The challenges of email include spam, gaining attention, and multiple addresses versus the challenges of direct mail being cost, flexibility, and keeping databases updated. When measuring email campaigns, metrics include open rates, clickthroughs, and interactions versus direct mail where cost and targeting unique individuals are strengths.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
This document provides an overview of parallel and distributed computing using GPUs. It discusses GPU architecture and how GPUs are designed for massively parallel processing using hundreds of smaller cores compared to CPUs which use 4-8 larger cores. The document also covers GPU memory hierarchy, programming GPUs using OpenCL, and key concepts like work items, work groups, and occupancy which is keeping GPU compute units busy with work to process.
CUDA by Example : Graphics Interoperability : NotesSubhajit Sahu
Highlighted notes of:
Chapter 8: Graphics Interoperability
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
The document discusses visualization systems and proposes concepts for their future development. It summarizes:
1) The "Visual Realityware" visualization software development environment, which uses an abstraction layer to allow developers to freely select mainstream graphics technologies and expand applications across multiple platforms with minimal bugs.
2) An application called "Virtual Anatomia" developed using Visual Realityware to visualize 3D biological data in real-time.
3) The concept of "Visionize" which is defined as a risk management methodology using visual communication to allow sharing of goals and visions in order to identify and prevent risks before issues arise.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The document describes mapping the SMAC atmospheric correction algorithm onto a GPU to improve its computational performance. Key points:
1. The SMAC algorithm consists of 14 computational steps that are mapped to GPU threads for parallel execution.
2. Implementing SMAC on GPU involves transferring input data from CPU to GPU memory, executing the SMAC kernel on the GPU, and copying outputs back to CPU memory.
3. An experiment compares the execution time of SMAC on CPU versus GPU. The improvement from GPU parallelization is calculated based on a linear execution-time model.
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
1) The document proposes an OpenGL based testing tool architecture for exascale computing to improve performance and accuracy of OpenGL programs.
2) It identifies common errors that occur when programming shaders in OpenGL Shading Language (GLSL) such as errors in file reading, compilation, linking, and rendering.
3) The proposed testing architecture divides the GLSL programming process into four stages - file reading, compilation, pre-linking/linking, and rendering - and validates each stage to detect errors and enforce error-free code.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
The document discusses improving the performance of the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP) on CUDA GPUs by avoiding duplicated computation among threads. It first provides background on GPU architecture, the PFSP problem, and related parallelization methods. It then observes that if two permutations share the same prefix, their completion time tables will contain identical column data equal to the length of the prefix, leading to duplicated computation. The paper proposes an approach where each thread is assigned a permutation and allocated shared memory to store and compute the completion time table in parallel, avoiding this duplicated work by leveraging the shared prefix property. Experimental results show the new approach runs up to 1.5 times faster than an existing
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScscpconf
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapiński and Barnes.
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
This document discusses using OpenCL parallel programming to compute Latin squares of order 3 more efficiently than sequential algorithms. It proposes dividing the input matrix into sub-matrices that are processed concurrently by multiple processing elements in the GPU. This parallel approach reduces the computation time compared to performing the operations sequentially on the CPU. First, the input matrix is divided based on task or data parallelism. Then the sub-matrices are computed simultaneously by different processing elements. The results are combined and stored in GPU memory before being transferred to CPU memory and output. Implementing the Latin square computation with OpenCL exploits parallelism to improve efficiency over the traditional sequential approach.
This document summarizes a thesis conducted using the GPGPU-Sim simulation software to study the impact of hardware features on the performance of GPU computing. The author used a convolution algorithm test on simulated NVIDIA GTX480 and Tesla C2050 GPUs while varying hardware parameters like cache sizes, frequencies, and number of cores. Results showed the GTX480 performed better than the Tesla C2050. Tests also found that a graphics card performing convolution relies most on L1 cache but has worst performance with a small shared memory. Using this simulator on theoretical GPU architectures could lead to better embedded system GPU designs.
CSCI 2121- Computer Organization and Assembly Language Labor.docxannettsparrow
CSCI 2121- Computer Organization and Assembly Language
Laboratory No. 5
Week of March 12th, 2018
Submission Instructions:
1. Save the files as shift.sv and vending.sv
2. Put all files into one folder.
3. Compress the folder into a .zip file.
4. Submit the zip file on Brightspace.
5. Submission Deadline: Sunday, April 15th, 2018, 11.55PM
SRC CHIP PROJECT
In order to proceed on the project for this course, there are several key Verilog concepts which are
necessary in the implementation of this code.
Part 0: More Verilog concepts.
Shared busses and tri-state buffers
Before we begin implementing the CPU, let's go over a few concepts which will be needed for the lab
assignment. The first is the concept of a shared bus. We've already seen busses in Verilog in terms of an
array of wires or registers, however in the context of our CPU, the word "bus" has another meaning. Our
CPU uses a shared "bus" which is a wire connected between the inputs and outputs of many different
registers, in order to share data:
This can be easily modelled in Verilog, using the inout keyword for a module. This defines a wire which
can act as both an input and an output to our module. However, there is one critically important design
concept which goes along with the usage of shared busses:
Only one signal can be written to a shared bus at any given time.
If two signals are written to the bus, the result is a bus collision, and the value is undefined. In the
simulation, this is represented as "X", but in real life this would cause garbage data to exist on the bus,
potentially corrupting any process which is running on the CPU. As Verilog programmers, it's our job to
ensure that a bus collision never happens. To facilitate this, we'll use a digital logic component called a
tri-state buffer:
The purpose of a tri-state buffer is to act as a switch. If B is 1, then the data can pass through the buffer,
but if not, it simply disconnects the wire, outputting a high-impedance state (in Verilog, denoted by Z).
hz943141
A B C
0 0 Z
0 1 0
1 0 Z
1 1 1
In Verilog, we can't directly write a tri-state buffer, but it can be easily synthesized using a ternary
operator as shown on line 11:
Line 11 shows the construction of a ternary operator to use as a tri-state buffer. If the input called
activate_out is set to 1, then the tri-state buffer is activated, and the circuit will put my_result on the
bus. Otherwise, it puts the value 32'bz on the bus, which will disconnect my_result from the bus, and
allow another circuit to write to the bus.
Note: It is important that during your lab assignment, any circuits which write to the bus have a tri-
state buffer implemented to prevent bus collisions.
Verilog Tasks:
Verilog tasks are like object functions in Java. Read more about them here. You may find that they are
useful in separating the code for instructions within a module, particularly for th.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
Similar to GPU Rigid Body Simulation GDC 2013 (20)
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
2. Welcome to this talk about GPU Rigid Body Dynamics. My name is Erwin Coumans and I
am leading the Bullet Physics SDK project. I have been working on Bullet 100% of my
time for about 10 years now. First as a Sony Computer Entertainment US R&D employee
and currently as AMD employee.
2
3. Our initial work on GPGPU acceleration in Bullet was cloth simulation. AMD contributed
OpenCL and DirectX 11 DirectCompute accelerated cloth simulation. This is available in
the Bullet 2.x SDK at http://bullet.googlecode.com
Here is a screenshot of the Samurai Demo with Bullet GPU cloth integrated.
3
4. More recently one of the engineers in my team worked on GPU accelerated hair
simulation. This work was presented at the Vriphys 2012 conference by Dongsoo Han
and integrated in the latest Tombraider 2013 game under the name TressFX.
4
5. We started exploring GPU rigid body simulation at Sony Computer Entertainment US
R&D. My SCEA colleague Roman Ponomarev started the implementation in CUDA and
switched over to OpenCL once the SDKs and drivers became available. I’ll briefly discuss
the first generation GPU rigid body pipeline.
When I joined AMD, my colleague Takahiro Harada implemented most of the second
generation GPU rigid body pipeline, around 2010.
From 2011 onwards I’m working on the third generation GPU rigid body pipeline, with
pretty much the same quality as the Bullet 2.x CPU rigid body simulation. This is work in
progress, but the results are already useful.
One of the GPU rigid body demos uses a simplified triangle mesh of the Samurai Cloth
demo, as you can see in this screenshot. The triangle mesh is reduced to around
100.000 triangles (there is no instancing used).
5
6. Before going into details, the most essential parts of a rigid body are the world space
position and orientation of rigid bodies. In our case the position defines the center of
mass, using a float3. A 4rd unused w-component is often added to make this structure
16 byte aligned. Such alignment can help SIMD operations.
The orientation defines the inertial basis frame of a rigid body and it can be stored as a
3x3 matrix or as a quaternion. For Bullet 2.x we used a 3x3 matrix, but in Bullet 3.x we
switched over to quaternions because it is more memory efficient.
6
7. The rigid body bodies are simulated forward in time using the linear and angular
velocity, using an integrator such as Euler integration. Both linear and angular velocity is
stored as a 3-component float.
7
8. If we only look at the position and linear velocity, the C/C++ version is just a one liner.
This would be sufficient for most particle simulations.
8
9. As you can see the OpenCL version of this kernel is almost identical to the C/C++ version
in previous slide. Instead of passing the body index through an argument, the OpenCL
work item can get its index using the get_global_id() API.
9
10. We can apply the gravity in this kernel as well. Alternatively the gravity can be applied
inside the constraint solver, for better quality.
10
11. When we introduce angular velocity and the update of the orientation, the code
becomes a bit more complicated. Here is the OpenCL kernel implementation, with
identical functionality to the Bullet 2.x position and orientation update.
We have the option to apply some damping and to clamp the maximum angular
velocity. For improved accuracy in case of small angular velocities, we use a Taylor
expansion to approximate the sinus function.
11
12. Here is the host code to setup the arguments and the call to execute the kernel on the
OpenCL device.
If we would have no constraints or collisions, we would be finished with our rigid body
simulation,
12
13. Once we add collision detection and constraint solving a typical discrete rigid body
simulation pipeline looks as in this slide. The white stages below are trivial to parallelize,
or embarrassingly parallel, so we don’t spend further time on it in this presentation.
The detection of potential overlapping pairs in blue, the contact point generation and
reduction in yellow and constraint solving stages in orange are the main stages that we
need to deal with.
13
14. Here is an overview of the 50 kernels of our current third generation GPU rigid body
pipeline. Note that 100% of the rigid body simulation is executed on the OpenCL
GPU/device.
The green kernels are general useful parallel primitives, such as a parallel radix sort, a
prefix scan and such. Often they are provided by a hardware vendor, such as AMD, Intel
or NVIDIA. Bullet includes its own implementation for parallel primitives, that are
reasonably well optimized to work well on most OpenCL GPU devices.
The kernels in blue are related to finding the potential overlapping pairs, also known as
broadphase collision detection. The yellow kernels are related to contact computation
and reduction, also known as narrowphase collision detection, as well as bounding
volume hieraries to deal with triangle meshes, often called midphase. The constraint
solver kernels are colored orange.
There is still more work to be done in various areas, such as supporting more collision
shape types and constraint types aside from contact constraints.
14
15. Before going into details how we optimized the various rigid body pipeline stages, here
is a quick overview of a system with Host and Device.
Generally the host will upload the data from Global Host Memory to Global Device
memory, upload the compiled kernel code to the device and enqueue the work
items/groups so that the GPU executes them in the right order.
The connection between host and device, here the orange arrow, is a PCI express bus in
current PCs. This connection will improve a lot in upcoming devices.
When using an integrated GPU we the global memory might be shared and unified
between host and device.
15
16. A typical GPU device contains multiple Compute Units, where each Compute Unity can
execute many threads, or work items, in parallel. For an AMD GPU there are usually 64
threads executed in parallel for each Compute Unit, we call this a wavefront. On NVIDIA
GPUs there are usually 32 threads or work items executed in parallel for each Compute
Unity, they call this parallel group of 32 threads a warp.
Each thread, or work item, has private memory. Private Memory on a 7970 is a 64kb
register file pool, each work item gets some registers allocated.
Each Compute Unit has shared local memory, that can accessed by all threads in the
Compute Unit. Shared Local Memory or LDS in AMD terminology, is 64kb on a 7970.
Local atomics can synchronize operations between threads in a Compute Unit on shared
local memory
The Global Device Memory can be access by all Compute Units. You can use global
atomic operations on Global Device Memory.
Shared Local Memory is usually an order of magnitude faster than global device
memory, so it is important to make use of this for an efficient GPU implementation.
The programmer will distribute work into work groups, and the GPU has the
responsibility to map the Work Groups to Compute Units. This is useful to make the
solution scalable: we can process our work groups on small devices such as cell phone
16
17. with just a single or few Compute Units, or we can execute the same program on a high-
end discrete GPU with many Compute Units.
16
18. Here is a quick view of the GPU hardware as presented by the operating system. On the
right we see a more detailed view from the GPU hardware through the OpenCL API. We
can make use of the low level information to optimize our program.
17
19. The same information is available on Mac OSX or Linux or other OpenCL devices.
18
20. It appears that some OpenCL implementation is available as “easter egg” as parts of
some OpenGL ES drivers in Android mobile devices such as the Nexus 4 or Nexus 10.
The Playstation 4 seems also suitable for GPGPU development: The system is also set up
to run graphics and GPGPU computational code in parallel, without suspending one to
run the other.
Chris Norden says that Sony has worked to carefully balance the two processors to
provide maximum graphics power of 1.843 teraFLOPS at an 800Mhz clock speed while
still leaving enough room for computational tasks. The GPU will also be able to run
arbitrary code, allowing developers to run hundreds or thousands of parallelized tasks
with full access to the system's 8GB of unified memory.
19
21. Here is the first generation of our GPU rigid body pipeline. A lot of simplification is used,
so we could re-use a lot of work that was already available in GPU particle simulation.
20
22. The uniform grid as already available in some OpenCL sdks, so we used this as a starting
point for the GPU rigid body pipeline.
21
23. We use parallel primitives a lot when implementing GPU friendly algorithms, for
example a parallel radix sort and prefix scan is used in a parallel uniform grid solution.
22
24. The contact point generation was also simplified a lot, by approximating collision shapes
using a collection of spheres. The process of creating a sphere collection, or voxelization,
could be performed on CPU or GPU.
23
25. Here are some screenshots of a simulation of some Stanford bunnies dropping on a
plane.
24
26. We use an iterative constraint solver based on Projected Gauss Seidel. In a nutshell, the
velocity of bodies needs to be adjusted to avoid violations in position (penetration) and
velocity. This is done by sequential application of impulses.
Pairwise constraints, such as contact constraints between two bodies need to be solved
sequentially in a Gauss Seidel style algorithm, so that the most up-to-date velocity is
available for each constraint.
An alternative would be to use Jacobi style constraint solving, where all constraints use
“old” velocities, so we can trivially parallelize it. The drawback is that Jacobi doesn’t
converge as well as Gauss Seidel.
25
27. We cannot trivially update the velocities for bodies in each constraint in parallel,
because we would have write conflicts. For example the contact constraint 1 in the
picture tries to write velocity of object B (and A), while contact constraint 2 tries to
update the same body B (and C). To avoid such write conflicts, we can sort the
constraints in batches, also known as graph coloring. All the constraints in each batch
are guaranteed not to access the same dynamic rigid body.
Note that non-movable or static rigid bodies, with zero mass, can be ignored: their
velocity is not updated.
26
28. We first implemented the graph coloring or batch creation on the CPU host side. The
code is very simple, as you can see. We iterate over all unprocessed constraints, and
keep track of the used bodies for this batch. If the two bodies in a constraint are still
available for this batch, we assign the batch index to the contraints. Otherwise the
constraint will be processed in another batch.
There is a small optimization that we can clear the used bodies when we reach the
device SIMD width, because only 32 or 64 threads are processed in parallel on current
GPU OpenCL devices.
27
29. We can execute the same code on the GPU in a single Compute Unit, and use local
atomics to synchronize between the threads. It is not optimal, but a good starting point.
In a later pipeline we have improved the batch creation.
28
30. The uniform grid works best if all objects are of similar size. In the second GPU rigid
body pipeline became more general, so we can deal with objects of different sizes.
Also we use an improved collision shape representation, instead of a sphere collection,
and we can use the full GPU capacity for constraint batch creation and solving.
29
31. Once we use other shapes then just spheres, it becomes useful to use a simplified
bounding volume for the collision shape. We use axis aligned bounding boxes on the
GPU, just like we do on CPU.
30
32. We can use the support mapping function to generate local space axis aligned bounding
boxes.
31
34. See the book "Collision Detection in Interactive 3D Environments", 2004, Gino Van Den
Bergen for more information about the support mapping.
33
35. The GPU is very suitable to compute a the dot product for many vertices in parallel.
Alternatively we can use the GPU cube mapping hardware to approximage the bounding
volume computation.
34
36. Instead of re-computing the world space bounding volume each time, we can
approximate the world space bounding volume using the pre-computed local space
AABB.
35
38. Here is the kernel code to compute the world space AABB from the local space AABB.
Basically the absolute dot product using the rows of the rotation matrix and its extents
will produce the world space AABB.
Note that the rows of the 3x3 orientation matrix are basically the axis of the local space
AABB.
37
39. The uniform grid doesn’t deal well with varying object sizes, so our first attempt to deal
with mixed sized objects was to use multiple broadphase algorithms, dependent on the
object size.
Computation of potential overlapping pairs between small objects can be done using a
uniform grid, while a pair search that involves larger objects could be done on CPU (or
GPU) using a different algorithm.
38
44. If we revisit the GPU hardware, we note that Compute Units perform their work
independently. So it would be good to generate batches in a way so that we can use
multiple independent Compute Units.
43
45. We perform the batching of constraints in two stages: the first stage splits the
constraints so that they can be processed by independent Compute Units. We use a
uniform grid for this, where non-neighboring cells can be processed in parallel.
44
46. The second stage batching within a Compute Unit is similar to the first GPU rigid body
pipeline, but we added some optimizations. Takahiro Harada wrote a SIGGRAPH paper
with more details.
45
47. We typically use around 4 to 10 iterations in the solver, and in each iteration we have
around 10 to 15 batches. The large amount of kernel launches can become a
performance bottleneck, so we implemented a GPU side solution that only requires a
single kernel launch per iteration.
46
48. Our third generation GPU rigid body pipeline has the same quality and is as general as
the regular discrete Bullet 2.x rigid body pipeline on the CPU.
47
49. The regular 3-axis sweep and prune broadphase pair search algorithms incrementally
updates the sorted AABBs for each of the 3 world axis in 3D.
We sort the begin and end points for each AABB to their new position, one object at a
time, using swap operations. We incrementally add or remove pairs during those swap
operations.
This exploits spatial and temporal coherency: objects don’t move a lot between frames.
This process is difficult to parallelizing due to data dependencies: globally changing data
structure would require locking.
48
50. We modify the 3D axis sweep and prune algorithm to make it more suitable for GPU,
while keeping the benefits of the incremental pair search during the swaps.
49
51. Instead of performing the incremental sort and swap together, we perform the sorting in
parallel as one stage, and perform the swaps in a separate stage using read-only
operations.
We maintain the previous sorted axis and compare it with the updated sorted axis. Each
object can perform the swaps by traversing the previous and current sorted elements,
without modifying the data.
Although unusual, we can detect rare degenerate cases that would lead to too many
swaps in the 3-axis SAP algorithm, and do a fallback to another broadphase.
Such fallback is not necessary in most practical simulations. Still, generally it can be a
good idea to mix multiple broadphase algorithms to exploit the best properties out of
each broadphase.
50
52. For convex objects we use the separating axis test and Sutherland Hodgeman clipping
and contact reduction to generate contacts.
Concave compound shapes and concave triangle meshes produce pairs of convex
shapes, that are processed alongside the convex-convex pairs that are produced by the
broadphase.
51
55. The contact clipping can generate a lot of contact points. We reduce the number of
contacts between two convex polyhedral to a maximum of 4 points.
We always keep the contact point with the deepest penetration, to make sure the
simulation gets rid of penetrations.
We keep 3 other contact points with maximum or minimum projections along two
orthogonal axis in the contact plane.
54
56. The convex-convex contact generation can be used between convex objects, but also
between the convex child shapes in a compound or for convex against individual
triangles of a triangle mesh.
We unify the pairs generated by the broadphase pair search, compound child pair search
and concave triangle mesh pairs so they can use the same narrowphase convex collision
detection routines.
The code for the separating axis test, contact clipping and contact reduction is very
large. This leads to very inefficient GPU usage, because of tread divergence and register
spill.
We break the the convex-convex contact generation up into a pipeline that consists of
many stages (kernels).
55
57. When a convex or compound shape collides against a concave triangle mesh, we
perform a BVH query using the world space AABB of the convex or compound shape,
against the precomputed BVH tree of the triangle mesh.
We optimized the BVH tree traversal to make it GPU friendly. Instead of recursion, we
iteratively traverse over an array of nodes.
We divide the BVH tree into smaller subtrees that fit in shared local memory, so they
can be processed by multiple threads in a work group.
56
58. Instead of Projected Gauss Seidel, we can also use the iterative Jacobi scheme to solve
the constraints.
Jacobi converges slower than PGS. We can improve the convergence by splitting the
objects into multiple “split bodies” for each contact.
Then, after solving an iteration using the Jacobi solver applied to the split bodies, we
average the velocities for all the split bodies and update the “original” body.
The idea is explained in the “Mass Splitting for Jitter-Free Parallel Rigid Body Simulation”
SIGGRAPH 2012 paper.
We can improve upon this scheme by mixing PGS and Jacobi even further. We know that
within a Compute Unit, only 64 threads are executed in parallel,
so we can only apply Jacobi within the active threads of a wavefront/warp and use PGS
outside (by updating the rigid body positions using the averaged split bodies).
57
59. We use plain OpenGL 3.x and a simple user interface called GWEN (gui without
extravagant nonsense) with some cross-platform classes to open an OpenGL 3.x context
for Windows, Linux and Mac OSX.
58
61. We performed some preliminary timings to make sure that our GPU rigid body pipeline
works well on GPU hardware of varying vendors.
Note that those numbers don’t give a good indication, because we haven’t optimized
the GPU rigid body pipeline good enough yet.
60
62. Here are some preliminary performance timings. Note that we haven’t optimized the
GPU rigid body pipeline well, so the numbers can be improved a lot.
61
63. We test our GPU rigid body pipeline on Windows, Mac OSX and Linux using AMD,
NVIDIA and Intel GPUs. The source code will be available at the github repository.
62