qCUDA-ARM : Virtualization for Embedded GPU Architectures 柏瑀 黃
This document discusses qCUDA-ARM, a virtualization solution for embedded GPU architectures. It provides an overview of edge computing challenges, GPU virtualization methods like API remoting, and the design and implementation of qCUDA-ARM. Key aspects covered include using address space conversion to enable zero-copy memory between guest and host, a pinned memory allocation mechanism for ARM, and improvements to memory bandwidth when copying data. The document evaluates qCUDA-ARM's performance on various benchmarks and applications compared to rCUDA and native CUDA.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
qCUDA-ARM : Virtualization for Embedded GPU Architectures 柏瑀 黃
This document discusses qCUDA-ARM, a virtualization solution for embedded GPU architectures. It provides an overview of edge computing challenges, GPU virtualization methods like API remoting, and the design and implementation of qCUDA-ARM. Key aspects covered include using address space conversion to enable zero-copy memory between guest and host, a pinned memory allocation mechanism for ARM, and improvements to memory bandwidth when copying data. The document evaluates qCUDA-ARM's performance on various benchmarks and applications compared to rCUDA and native CUDA.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
We’re sure you’ve heard by now about the release of Qt 6.0 — but are you ready? Join the conversation with our Qt Engineers to discuss the present and future of Qt 6. Ranging from API updates to porting tips, we’ll save you valuable time by giving a shortcut overview of changes that have made it into this first version.
The document provides a history of GPUs and GPGPU computing. It describes how GPUs evolved from fixed hardware for graphics to programmable hardware. This allowed general purpose computing on GPUs (GPGPU). It discusses the development of GPGPU languages and APIs like CUDA, OpenCL, and DirectCompute. The anatomy of a modern GPU is explained, highlighting its massively parallel architecture. Typical GPGPU execution and memory models are outlined. Usage of GPGPU for applications like graphics, physics, computer vision, and HPC is mentioned. Leading GPU vendors and their products are briefly introduced.
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
This document discusses AMD's DirectGMA technology, which allows direct access to GPU memory from other devices. It introduces DirectGMA and explains how it enables peer-to-peer transfers between GPUs and GPUs and FPGAs. It then provides details on implementing DirectGMA in APIs like OpenGL, OpenCL, DirectX 9, 10 and 11 to enable efficient data transfers without CPU involvement.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
Pmemkv is an open source, key-value store for persistent memory based on the Persistent Memory Development Kit (PMDK). Written in C and C++, it provides optimized bindings for Java*, Javascript*, and Ruby on Rails*), and includes multiple storage engines for different use cases.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
These are the slides of my talk WebKit and GStreamer of the GStreamer Conference on 2013, cohosted with LinuxCon.
The HTML5 version with its effects can be found at http://people.igalia.com/xrcalvar/talks/20131022-GstConf-WebKit
Mantle allows Battlefield 4 to significantly improve CPU and GPU performance compared to DirectX 11. The game utilizes Mantle's low-level access to optimize shader compilation, pipeline state management, asynchronous compute and memory handling. Multi-GPU rendering is supported through Alternate Frame Rendering where resources are duplicated and updated asynchronously across GPUs.
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...AMD Developer Central
Presentation WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour and Brian Salomon at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
Presentation WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael Sevenier, at the AMD Developer Summit (APU13) November 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
We’re sure you’ve heard by now about the release of Qt 6.0 — but are you ready? Join the conversation with our Qt Engineers to discuss the present and future of Qt 6. Ranging from API updates to porting tips, we’ll save you valuable time by giving a shortcut overview of changes that have made it into this first version.
The document provides a history of GPUs and GPGPU computing. It describes how GPUs evolved from fixed hardware for graphics to programmable hardware. This allowed general purpose computing on GPUs (GPGPU). It discusses the development of GPGPU languages and APIs like CUDA, OpenCL, and DirectCompute. The anatomy of a modern GPU is explained, highlighting its massively parallel architecture. Typical GPGPU execution and memory models are outlined. Usage of GPGPU for applications like graphics, physics, computer vision, and HPC is mentioned. Leading GPU vendors and their products are briefly introduced.
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
This document discusses AMD's DirectGMA technology, which allows direct access to GPU memory from other devices. It introduces DirectGMA and explains how it enables peer-to-peer transfers between GPUs and GPUs and FPGAs. It then provides details on implementing DirectGMA in APIs like OpenGL, OpenCL, DirectX 9, 10 and 11 to enable efficient data transfers without CPU involvement.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
Pmemkv is an open source, key-value store for persistent memory based on the Persistent Memory Development Kit (PMDK). Written in C and C++, it provides optimized bindings for Java*, Javascript*, and Ruby on Rails*), and includes multiple storage engines for different use cases.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
These are the slides of my talk WebKit and GStreamer of the GStreamer Conference on 2013, cohosted with LinuxCon.
The HTML5 version with its effects can be found at http://people.igalia.com/xrcalvar/talks/20131022-GstConf-WebKit
Mantle allows Battlefield 4 to significantly improve CPU and GPU performance compared to DirectX 11. The game utilizes Mantle's low-level access to optimize shader compilation, pipeline state management, asynchronous compute and memory handling. Multi-GPU rendering is supported through Alternate Frame Rendering where resources are duplicated and updated asynchronously across GPUs.
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...AMD Developer Central
Presentation WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour and Brian Salomon at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
Presentation WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael Sevenier, at the AMD Developer Summit (APU13) November 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
GPUs have evolved from graphics cards to platforms for general purpose high performance computing. CUDA is a programming model that allows GPUs to execute programs written in C for general computing tasks using a single-instruction multiple-thread model. A basic CUDA program involves allocating memory on the GPU, copying data to the GPU, launching a kernel function that executes in parallel across threads on the GPU, copying results back to the CPU, and freeing GPU memory.
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloudUniva, an Altair Company
This document discusses scheduling GPUs on premise and in the cloud with Grid Engine. It covers challenges in using GPUs with Grid Engine, how applications interact with GPUs, configuring metadata for GPUs in Grid Engine, GPU and CPU binding, managing environments and containers for GPUs, accounting for GPU usage in Grid Engine, and an example workflow for setting everything up. It also previews upcoming improvements in Grid Engine for better support of GPUs like the A100 and MIGs.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
Graphics Processing Units (GPUs) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs have evolved from simple video controllers to massively parallel multi-core processors capable of general purpose computing. Modern GPUs use a single-instruction, multiple-thread (SIMT) architecture and have a parallel programming model that maps computations to thousands of threads to maximize throughput. Programming frameworks like CUDA and OpenCL allow general purpose programming on GPUs by mapping algorithms to their highly parallel structure.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
Presentation delivered at the 2017 LinuxCon China.
Container is a popular technology in cloud with the merits of fast provisioning, high density and near-native performance. New requirements have been rising to further use GPU to accelerate various applications in containers, e.g. media trans-coding, machine learning, etc. However there is still gap of managing GPU in such container usages. This presentation will first review the status of GPU support on both native container and Intel clear container, then provide an anatomy of technical gaps and enabling plans in major areas (orchestration layer, resource isolation and QoS performance isolation), then review our work for GPU virtualization in clear container and prototype work through extensions in docker plugin, cgroup and GPU scheduler to fix those gaps.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
This lecture discusses manycore GPU architectures and programming, focusing on the CUDA programming model. It covers GPU execution models, CUDA programming concepts like threads and blocks, and how to manage GPU memory including different memory types like global and shared memory. It also discusses optimizing memory access patterns for global memory and profiling CUDA programs.
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...The Linux Foundation
With the grown interest in virtualization from big players around the world there are more and more companies choose ARM SoCs as their target platform for running server environments. It is also known that majority of such SoCs come with broad coprocessors available on the die, e.g. GPU, DSP, security etc. But at the moment the only way to speed up guests with these is either using a para-virtualized approach or making that HW dedicated to a specific guest.
Shared coprocessor framework for Xen aims to allow all guest OSes to benefit from this companion HW with ease while running unmodified software and/or firmware on guest side. You don’t need to worry about setting up IO ranges, interrupts, scheduling etc.: it is all covered, making support of new shared HW way faster.
As an example of the shared coprocessor framework usage a virtualized GPU will be shown.
This document provides an overview of the CUDA programming model for parallel computing on GPUs. It describes key CUDA concepts like the host/device memory model, threads and blocks, and how to define and launch kernels. It also provides examples of CUDA APIs for memory management and host-device data transfer. The document aims to introduce the basic features of the CUDA programming model to developers.
Automatic generation of platform architectures using open cl and fpga roadmapManolis Vavalis
This document discusses using OpenCL to automatically generate platform architectures for FPGAs. It introduces FPGAs and their architecture, then discusses how OpenCL can be used as a hardware description language. The Silicon OpenCL (SOpenCL) tool flow is presented, which takes an unmodified OpenCL application and converts it into an FPGA system design with hardware and software components. Key steps in SOpenCL include code transformations, granularity management, and architectural synthesis to generate customized FPGA accelerators from OpenCL kernels. Monte Carlo simulations are provided as an example of an application that could exploit multiple levels of parallelism on FPGAs using this approach.
Similar to S0333 gtc2012-gmac-programming-cuda (20)
The document discusses preliminary findings regarding the Xbox Series X (XSX) architecture based on information from AMD slides and cache sizes from RDNA1. It notes that in the RDNA1 (Navi10) architecture, two compute engines form a workgroup processor, and five workgroup processors form an asynchronous compute engine. It provides details on cache and memory sizes per workgroup processor and compute unit from AMD slides. It also notes similarities between the XSX and RDNA1 structures but that the number of compute units per group may be customized and references sources showing six compute units per group.
Dark Secrets: Heterogeneous Memory Models is a document about the challenges of memory consistency models across heterogeneous systems like OpenCL. It discusses three main issues with current models: coarse-grained access to data, separate addressing between devices and hosts, and weak/undefined ordering controls. It also covers basics of the OpenCL 2.0 memory model, which aims to address these issues by allowing shared virtual addressing between devices and hosts, finer-grained synchronization using events, and atomic operations for ordering. However, memory models still have limitations around data races and forward progress that require careful programming.
The document discusses DirectX 12 and how it allows for more efficient use of resources and parallel execution across multiple GPUs. It provides examples of how DirectX 12 enables features like allocating resources once and sharing them across GPUs, allowing the GPUs to process frames in parallel to improve performance. New API features in DirectX 12 give developers more control over hardware compared to earlier APIs.
DirectX 12 provides major performance improvements over previous versions by optimizing the entire graphics stack and distributing work more efficiently between the application, engine, driver, operating system, and GPU. It uses a multi-threaded execution model with asynchronous resource access and command buffers, and supports features like execute indirect, multi-engine/multi-adapter for scalability. Hardware support is organized into feature levels to identify supported capabilities independent of the API version. Tools are integrated to help developers optimize their games and applications for DirectX 12.
This document discusses strategies for dealing with asynchronous operations in a graphics application using APIs like Vulkan and DirectX 12. It recommends planning out frame submission to separate command queues for different tasks like resource creation and deletion. Global and per-frame resources are tracked to determine when they need to be committed or uncommitted. Readbacks from the GPU are queued and handled asynchronously to avoid stalling. Presentation requires special handling on Windows to avoid threading issues. Benchmark results show large potential performance gains from next-generation APIs and hardware.
The technology behind_the_elemental_demo_16x9-1248544805mistercteam
This document discusses advances in real-time rendering techniques used in the Unreal Engine 4 Elemental demo. It describes technologies like voxel cone tracing for indirect lighting, deferred shading, screen space ambient occlusion, image-based lens flares, HDR histograms for eye adaptation, and more. The presentation provides technical details and comparisons of these techniques for real-time rendering of complex 3D scenes.
This lecture discusses advanced SRAM technologies including FinFET-based SRAM issues and alternatives. It begins with fundamentals of 6T SRAM design and reviews state-of-the-art SRAM performance. Key challenges for FinFET-based SRAM include variability impacts and design tradeoffs. The lecture explores techniques to improve SRAM stability for FinFETs and reviews Intel's 22nm tri-gate SRAM technology. Finally, it discusses SRAM alternatives such as 8T cells, SDRAM, and context memory to address scaling challenges.
This document provides an overview of the "Beyond Programmable Shading" course being offered at the University of Washington in Winter 2011. The course aims to teach students about the state-of-the-art in real-time rendering hardware, programming models, and algorithms. It will cover GPU and CPU architectures, parallel programming models for graphics, and the latest real-time rendering techniques. The course will be taught by Aaron Lefohn from Intel and Mike Houston from AMD.
This document discusses using DirectCompute for cloth simulation. It describes how cloth simulation works by computing forces between connected particles and updating their positions over multiple iterations. It outlines the steps of computing forces based on spring lengths, applying position corrections from forces, and computing new positions. It also discusses representing springs and masses, and how a CPU typically handles the simulation by updating links sequentially to conserve momentum. DirectCompute can accelerate this by processing multiple particles in parallel.
This document describes a primitive processing and advanced shading architecture for embedded systems. It features a vertex cache and programmable primitive engine that can process fixed and variable size primitives with reduced memory bandwidth requirements. The architecture includes a configurable per-fragment shader that supports various shading models using dot products and lookup tables stored on-chip. This hybrid design aims to bring appealing shading to embedded applications while meeting limitations on gate size, power consumption, and memory traffic growth.
This document provides recommendations for optimizing DirectX 11 performance. It separates the graphics pipeline process into offline and runtime stages. For the offline stage, it recommends creating resources like buffers, textures and shaders on multiple threads. For the runtime stage, it suggests culling unused objects, minimizing state changes, and pushing commands to the driver quickly. It also provides tips for updating dynamic resources efficiently and grouping related constants together. The goal is to keep the CPU and GPU pipelines fully utilized for maximum performance.
This document provides an overview of the Mantle programming guide and API reference. It discusses Mantle's software infrastructure and execution model. It also summarizes Mantle's approach to memory management, objects, pipelines, shaders, and error handling. The guide describes basic Mantle operations like GPU initialization and resource creation. It also covers topics like queues, command buffers, compute and graphics operations, synchronization, and shader atomics.
The document discusses key differences between Direct3D 11 and the new Direct3D 12 API. D3D12 provides lower-level access to the hardware for increased control and performance. It utilizes command lists, resource binding tables, explicit synchronization, and multiple queues to allow for greater parallelism and efficiency. Developers now have responsibility for correct rendering but also more opportunities to optimize through techniques like command list threading and avoiding redundant state changes.
D3D12 introduces a new API and underlying philosophy that is closer to the hardware for increased efficiency and performance. It trades off some safety for more control and power. The software model has changed significantly from D3D11 with features like command lists for parallelism, explicit resource binding and memory management, and removing implicit synchronization. This provides opportunities for higher performance but also new complexity to consider in areas like profiling and debugging.
This document discusses advancements in tiled-based compute rendering. It describes current proven tiled rendering techniques used in games. It then discusses opportunities for improvement like using parallel reduction to calculate depth bounds more efficiently than atomics, improved light culling techniques like modified Half-Z, and clustered rendering which divides the screen into tiles and slices to reduce lighting workloads. The document concludes clustered shading has potential savings on culling and offers benefits over traditional 2D tiling.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
HCL Notes and Domino License Cost Reduction in the World of DLAU
S0333 gtc2012-gmac-programming-cuda
1. Javier Cabezas
Isaac Gelado
Lluís Vilanova
Nacho Navarro
Wen-Mei Hwu
NVIDIA GTC
May 17th, 2012
BSC/Multicoreware
BSC/Multicoreware
BSC
UPC/BSC
UIUC/Multicoreware
GMAC 2
Easy and efficient
programming for
CUDA-based systems
2. NVIDIA GPU Technology Conference – May, 17th , 2012 2
• BSC develops a wide range of High Performance Computing applications
• Alya (Computational Mechanics), RTM (Seismic imaging), Meteorological
modeling, Air quality, …
• Explore the trends in computer architecture (GPUs, FPGAs, …)
• Programming models research group
• OmpSs: task-based programming model that exploits parallelism at different
levels (multi-core, multi-GPU, multi-node, …)
• Accelerators research group
• System support for accelerators
• Collaboration with UIUC and Multicoreware Inc.
└Barcelona Supercomputing Center (BSC)
Introduction
3. NVIDIA GPU Technology Conference – May, 17th , 2012 3
• Goal: programmability of heterogeneous parallel architectures
• Task-based programming
model
• Simple pragma annotations
• Automatic task dependency
tracking
• Needs support for a growing
number of devices, APIs and
platforms
└OmpSs
Introduction
#pragma omp task inout([NB*NB] A)
void spotrf_tile(float *A,long NB);
#pragma omp task input([NB*NB] A, [NB*NB] B)
inout([NB*NB] C)
void sgemm_tile(float *A, float *B, float *C, u_long NB);
#pragma omp task input([NB*NB] A) inout([NB*NB] C)
void syrk_tile( float *A, float *C, long NB)
for (int j = 0; j < N; j+=BS) {
for (int k= 0; k< j; k+=BS)
for (int i = j+BS; i < N; i+=BS)
sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]);
for (int i = 0; i < j; i+=BS)
ssyrk_tile(BS, N, &A[i][j], &A[j][j]);
spotrf_tile(BS, N, &A[j][j]);
for (int i = j+BS; i < N; i+=BS)
strsm_tile(BS, N, &A[j][j], &A[j][i]);
}
4. NVIDIA GPU Technology Conference – May, 17th , 2012 4
• GMAC provides an easy system-level programming model for accelerator-based
systems
• Shared-memory model
• Single pointer per allocation
• No explicit copies
• Portability across systems
• Hardware capabilities
• System configuration
• Portability across platforms
• CUDA, OpenCL
• Windows, Linux, MacOS X
└GMAC
Introduction
void vec_add(size_t N)
{
float *a, *b, *c;
gmacMalloc(&a, N * sizeof(float));
gmacMalloc(&b, N * sizeof(float));
gmacMalloc(&c, N * sizeof(float));
// Runs on the CPU
init_array(a, N);
init_array(b, N);
// Runs on the GPU
add_kernel<<<N/512, N>>>(c, a, b, N);
}
5. NVIDIA GPU Technology Conference – May, 17th , 2012 5
• Since last GTC, many exciting things have happened
• CUDA 4
• Fermi/Kepler
• UVAS
• Peer-to-peer memory transfers
• Host threads can access any device memory (cudaMemcpy)
• …
• … but most of them were already implemented in GMAC
• So we have been going forward
└GMAC
Introduction
6. NVIDIA GPU Technology Conference – May, 17th , 2012 6
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
7. NVIDIA GPU Technology Conference – May, 17th , 2012 7
• New memory addressing capabilities
• Private memories
• Copies between memories through PCIe
• Host-mapped memory (Compute Capability 1.1)
• e.g. Output buffers in systems with discrete GPUs
• Avoid unnecessary copies in shared memory
systems (e.g. ION)
• Peer-memory access (Compute Capability 2.0)
• Avoid inter-GPU copies
• Only for GPUs in the same PCIe bus
PCIe
GDDR
GPU
└A universe of capabilities
GDDR
GPU
DDR
CPU
GPU
Programmability issues in CUDA systems
8. NVIDIA GPU Technology Conference – May, 17th , 2012 8
• New memory addressing capabilities (performance)
└A universe of capabilities
Programmability issues in CUDA systems
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
Time(ms)
3D volume size
c) GPU 3D stencil computation (25 point)
Copy
Peer
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
Time(ms)
Matrix size
a) Matrix multiplication (local in, remote out)
CopyOutput
PeerOutput
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
Time(ms)
Matrix size
b) Matrix multiplication (remote in, local out)
CopyInput
PeerInput
9. NVIDIA GPU Technology Conference – May, 17th , 2012 9
• Execution concurrency
• Concurrency between GPU execution and
memory transfers (Compute Capability 1.1)
• Concurrent HostToDevice/DeviceToHost memory
transfers (Compute Capability 2.0)
• Concurrent kernel execution (Compute Capability 2.0)
• Limited to back-to-back execution
• Limited to a single CUDA context
• Complete concurrent kernel execution (Compute Capability 3.5)
└A universe of capabilities
Programmability issues in CUDA systems
GPU
CPU
PCIe
10. NVIDIA GPU Technology Conference – May, 17th , 2012 10
• Describing memory accessibility
• Private memory: cudaMalloc/malloc
• GPU → Host shared memory: cudaHostRegister/cudaHostAlloc
• Host memory pinned and mapped on the GPUs’ page tables
• GPU↔GPU shared memory : cudaDeviceEnablePeerAccess
• GPU page tables synchronized
• Context-grain sharing (and only works for contexts on different GPUs)
• Describing dependencies among operations
• CUDA stream: in-order queue to execute implicitly dependent operations
• CUDA event: marker in a stream that allows to use finer-grain synchronization
└Access to hardware capabilities
Programmability issues in CUDA systems
11. NVIDIA GPU Technology Conference – May, 17th , 2012 11
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
12. NVIDIA GPU Technology Conference – May, 17th , 2012 12
• A process provides a single logical
address space
• Defines the memory objects that can be
accessed in a program
• Implemented on top of multiple virtual
address spaces
• An address space defines which memory
objects can be accessed from a device
• In contrast to the unfied virtual address
space offered by CUDA
└Memory model
CPU
GPU 1
GPU 2
Process
GMAC 2
13. NVIDIA GPU Technology Conference – May, 17th , 2012 13
• Memory objects can be made accessible to any device
• Single allocation + remote access
• Replication + software-based coherence is used when
no hardware access is available
• Each virtual address space (AS) can have
different views of an object
• A view is created by mapping an object on
a virtual address space
• Properties of the view define the behavior
of the coherence protocol (DSM)
└Memory model
Obj
View
View
View
R/W
R/W
R
GMAC 2
AS 1
AS 2
AS 3
14. NVIDIA GPU Technology Conference – May, 17th , 2012 14
└Memory model
b) Host-mapped
memory
a) Replication
Software-based coherence
o p
Peer memory access
Physical
Virtual
Logical
CPU GPU 1 GPU 2
GMAC 2
15. NVIDIA GPU Technology Conference – May, 17th , 2012 15
• Automatic optimizations
• Pinned memory management for memory transfers
• Eager memory transfers
• Features available in the hardware are used to optimize some scenarios
• Implementation details: restrictions
• CUDA UVAS makes replicated objects have different virtual addresses
• Stored pointers cannot be used on the device
• Mappings work at page granularity
└Memory model
GMAC 2
16. NVIDIA GPU Technology Conference – May, 17th , 2012 16
• A process contains several execution contexts
• The execution context (OS thread) is the basic unit of
concurrency
• Each execution context is assigned a default GPU
• Inherited on creation
• Like CUDA 4 “current” device (cudaSetDevice)
• CUDA code is executed on…
• Implicitly: using the default GPU
• Explicitly: a specific GPU (using the stream parameter of the call)
└Execution model
GPU1
GPU2
GMAC 2
17. NVIDIA GPU Technology Conference – May, 17th , 2012 17
• Data accessible on a GPU is implicitly
acquired/released at kernel call boundaries
• Provides implicit synchronization on data
dependencies
• Dependencies between kernel calls are defined
by the programmer using the standard
inter-thread synchronization mechanisms
• Explicit consistency is also available for fine tuning
└Execution model
GMAC 2
o
o
p
p
CPU GPU1
18. NVIDIA GPU Technology Conference – May, 17th , 2012 18
-> release ownership of o
-> release ownership of p
kernel<<<grid, block>>>(o, p);
for (size_t i = 0; i < SIZE_C; ++i) {
q[i]++;
}
gmacThreadSyncronize();
-> acquire ownership of o
-> acquire ownership of p
└Explicit consistency
GMAC 2
o o
p p
q
q
CPU GPU1
19. NVIDIA GPU Technology Conference – May, 17th , 2012 19
• Hides specific CUDA abstractions needed for execution concurrency
(streams, events…)
• Allows concurrent code execution and memory transfers and back-to-
back kernel execution
• When triggered from different execution contexts (host threads)
└Execution model
GMAC 2
20. NVIDIA GPU Technology Conference – May, 17th , 2012 20
• C/C++
• More flexible and composable API than GMAC
• Allows to specify memory access rights
• Allows explicit consistency management
• POSIX-like semantics
• GMAC primitives are still provided
• Easily implemented on top of the new API
• Single interface for CUDA/OpenCL programs
└GMAC 2 API
GMAC 2
21. NVIDIA GPU Technology Conference – May, 17th , 2012 21
void * gmac::map(void *ptr, size_t count, int prot,
int flags, device_id id)
void * gmac::map(const std::string &path, off_t offset, size_t count,
int prot, int flags, device_id id)
• ptr == NULL creates a new logical object, and creates a view of the object
for the device id
• ptr != NULL creates a view of the specified object for the device id
• prot declares the kind of access to the object: RO,W,RW
• flags allows to specify some special properties for the view
• e.g. gmac::MAP_FIXED: use the address ptr for the new view
└Memory management API
GMAC 2
22. NVIDIA GPU Technology Conference – May, 17th , 2012 22
error gmac::set_gpu(device_id id)
• Sets id as the default GPU to be used in implicit operations on the current
execution context
device_id gmac::get_gpu()
• Gets the id of the default GPU assigned to the current execution context
└Execution model API
GMAC 2
23. NVIDIA GPU Technology Conference – May, 17th , 2012 23
error func<<<grid, block, shared, device_id>>>(...)
• Executes the kernel func on the GPU with identifier device_id. If no
device is specified, the default one is used
• Before executing, transfers the ownership of all the GPU accessible objects to
the GPU…
• … and acquires the ownership for the CPU again after kernel execution
• Alternative syntax for OpenCL based on C++11 variadic templates
kerne.launch::launch(config global,
config block,
config offset, device_id)(...)
└Execution model API
GMAC 2
24. NVIDIA GPU Technology Conference – May, 17th , 2012 24
error gmac::acquire(void *ptr, size_t count, int prot, device_id id)
• Ensures that:
1) Device id sees an updated copy of the given data object
2) Device id has the ownership of the object
• prot specifies weaker or equal access level than in map
error gmac::release(void *ptr, size_t count)
• Releases the ownership of the data object
└Memory coherence/consistency API
GMAC 2
25. NVIDIA GPU Technology Conference – May, 17th , 2012 25
• Previous calls are still available
• gmacMalloc, gmacFree, gmacGlobalMalloc
• Implemented on top of the new calls
• You can mix both (if you know what are you doing)
└Backward compatibility
GMAC 2
26. NVIDIA GPU Technology Conference – May, 17th , 2012 26
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
27. NVIDIA GPU Technology Conference – May, 17th , 2012 27
gmacError_t gmacMalloc(void **a, size_t count)
{
// Map on host memory
*a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on the thread’s current GPU
gmac::map(a, count,
gmac::PROT_READWRITE,
gmac::MAP_FIXED,
gmac::get_gpu());
return gmacSuccess;
}
└Simple shared allocation
GPUCPU
GMAC 2: Use cases
28. NVIDIA GPU Technology Conference – May, 17th , 2012 28
• Share input data across GPUs (aka gmacGlobalMalloc)
float * a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on GPU1
gmac::map(a, count,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu1_id);
// Map on GPU2
gmac::map(a, count,
gmac::PROT_READ,
gmac::MAP_FIXED, gpu2_id);
└Multi-GPU data replication
GPU2
CPU
GPU1
GMAC 2: Use cases
29. NVIDIA GPU Technology Conference – May, 17th , 2012 29
• Partition data structures among GPUs
// Map on host memory
float * a = gmac::map(NULL, count,
gmac::PROT_READWRITE,
0,
gmac::cpu_id);
// Map on GPU1 memory
gmac::map(a, count/2,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu1_id);
// Map on GPU2 memory
gmac::map(a + count/2, count/2,
gmac::PROT_READWRITE,
gmac::MAP_FIXED, gpu2_id);
└Data partitioning
GPU2
CPU
GPU1
GMAC 2: Use cases
31. NVIDIA GPU Technology Conference – May, 17th , 2012 31
• Input from file
a = gmac::map(“input.data”, 0, count,
gmac::PROT_READ,
0, gpu1_id);
• Optimize writing output to file (host-mapped memory)
a = gmac::map(“output.data”, 0, count,
gmac::PROT_WRITE,
0, gpu1_id);
└File I/O
GPU
GPU
GMAC 2: Use cases
32. NVIDIA GPU Technology Conference – May, 17th , 2012 32
• Programmability issues in CUDA systems
• GMAC 2
• Use cases
• Conclusions
Outline
33. NVIDIA GPU Technology Conference – May, 17th , 2012 33
• Eases the programmability on CUDA-based systems
• Offers a flexible API that allows to declare complex data schemes
• Transparently exploits the capabilities of the underlying hardware
• Solves the shortcomings of the previous GMAC version
• Backward compatibility
• We need support in the CUDA driver to fully implement replicated objects
(while keeping the same memory address)
└GMAC 2…
Conclusions
34. NVIDIA GPU Technology Conference – May, 17th , 2012 34
• HAL (Hardware Abstraction Layer)
• Generic hardware abstractions
• DSM (Distributed Shared Memory)
• Coherence among address spaces using
release consistency
• ULAS (Unified Logical Address Space)
• Provides a single flat logical space for all memories
• System-level programming model
└Design
HAL
Programming model
DSM ULAS
GMAC 2
35. NVIDIA GPU Technology Conference – May, 17th , 2012 35
• Questions?
Thanks for your attention
37. NVIDIA GPU Technology Conference – May, 17th , 2012 37
Programmability issues in CUDA systems
• GPUs are still I/O devices in the OS
• No enforcement of system-level policies
• No standard memory management routines
• No shared memory across processes
• No GPU memory swapping
• Copies to/from I/O devices (GPU Direct is an ad-hoc solution for Infiniband)
• No standard code execution routines
• No scheduling
└Exposing hardware capabilities
39. NVIDIA GPU Technology Conference – May, 17th , 2012 39
Backup slides
• Physical description layer: description of the components in the system
• Processing units (e.g., GPUs, CPUs)
• Physical Memories (e.g., CPU memory, GPU memory)
• Physical address spaces
• Aggregation of memories that are directly accessible from a processing unit
• CUDA: GPU Direct 2, Host-mapped memory
• OpenCL: clEnqueueMap
└GMAC 2: Hardware Abstraction Layer
40. NVIDIA GPU Technology Conference – May, 17th , 2012 40
GMAC 2
• System abstractions
• Virtual address space
• struct mm_struct
• Object: physical memory
• Collection of struct page
• View: mapping of an object in a vAS
• struct vm_area_struct
• Virtual processing unit: execution context to be executed on a processing unit
• struct task_struct
└Hardware Abstraction Layer
42. NVIDIA GPU Technology Conference – May, 17th , 2012 42
GMAC 2
• Links memory areas:
dsm::link(hal::ptr dst, hal::ptr src, size_t count, int flags);
• Acquire/release concistency
dsm::acquire(hal::ptr p, size_t count, int flags);
dsm::release(hal::ptr p, size_t count);
• Depending on the conditions, optimizations are enabled
• Eager transfer
└Distributed Shared Memory
43. NVIDIA GPU Technology Conference – May, 17th , 2012 43
GMAC 2
• Since the introduction of CUDA 4.0… but!
• Not in 32-bit machines
• Data structures must reside in pinned memory
└Unified Virtual Address Space