The document provides an introduction to GPU programming using CUDA. It outlines GPU and CPU architectures, the CUDA programming model involving threads, blocks and grids, and CUDA C language extensions. It also discusses compilation with NVCC, memory hierarchies, profiling code with Valgrind/Callgrind, and Amdahl's law in the context of parallelization. A simple CUDA program example is provided to demonstrate basic concepts like kernel launches and data transfers between host and device memory.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
The document discusses optimizations for memory and communication in massively parallel computing. It recommends caching data in faster shared memory to reduce loads and stores to global device memory. This can improve performance by avoiding non-coalesced global memory accesses. The document provides an example of coalescing writes for a matrix transpose by first loading data into shared memory and then writing columns of the tile to global memory in contiguous addresses.
This document discusses GPU computing with CUDA and NVIDIA Tesla hardware. It provides an overview of GPU computing and how it differs from CPU computing in being optimized for data-parallel throughput rather than low latency. It also describes the key specifications of the NVIDIA Tesla C1060 GPU and Tesla streaming multiprocessor. Finally, it outlines the CUDA parallel computing architecture and programming model, including how applications use the GPU as a coprocessor through kernels launched from the CPU.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document summarizes an MIT lecture on GPU cluster programming using MPI. It provides administrative details such as homework due dates and project information. It also announces various donations of computing resources for the class, including Amazon AWS credits and a Tesla graphics card for the best project. The lecture outline covers the problem of computations too large for a single CPU, an introduction to MPI, MPI basics, using MPI with CUDA, and other parallel programming approaches.
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
Ian Buck developed GPU computing at Nvidia. CUDA 1.0 was released in 2006, allowing normal applications to utilize GPU processing for higher performance without low-level programming. A GPU can execute many more instructions per clock than a CPU due to its large number of arithmetic logic units. In CUDA, programs specify blocks and threads to distribute work across a GPU. Calling a GPU function launches the specified number of blocks with threads. This massive parallelism allows GPUs to greatly accelerate brute force searches.
1. The CUDA programming model uses parallel threads organized in cooperative thread arrays (CTAs) to execute the same program on many threads simultaneously.
2. CTAs are grouped into grids and threads within a CTA can share memory. Each CTA implements a thread block.
3. The GPU architecture has streaming multiprocessors that perform computations and global memory like CPU RAM that is accessible to both the GPU and CPU.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
This document provides an introduction to the NetBSD kernel. It discusses NetBSD's history and focus on portability across architectures. Key features of the NetBSD kernel discussed include its process scheduling, SMP support, threading model using scheduler activations, and event notification using kqueues. Debugging support via DDB and KGDB is also summarized. The document provides a brief overview of NetBSD's build system and configuration, and notes some limitations in device support. It concludes by highlighting NetBSD's clean code, documentation, and commercial support options.
The document introduces SGC Ruby CUDA, a Ruby library that provides an object-oriented API for CUDA programming to bridge Ruby and CUDA C/C++. It allows performing operations like memory allocation and transfer as well as kernel launching from Ruby. The library aims to make CUDA programming accessible from Ruby while hiding complexity of the low-level CUDA driver and runtime APIs.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
Modern NetBSD allows adding and removing kernel functionality dynamically through kernel modules. Kernel modules can be loaded and unloaded without needing to recompile or reboot the kernel. They are placed in directories under /stand and can be loaded at boot time via boot.cfg or using the modload command. Writing kernel modules is straightforward, requiring only MODULE and _modcmd declarations. However, there are some limitations around passing parameters and creating device nodes.
CUDA is a parallel computing platform that allows developers to use GPUs for general purpose processing. It provides a programming model for writing C/C++ applications that leverage the parallel compute engines on Nvidia GPUs. CUDA applications use a data-parallel programming model where the GPU runs many lightweight threads concurrently. The CUDA programming model exposes a hierarchical memory structure including registers, shared memory, and global memory. Developers can write CUDA programs that transfer data from CPU to GPU memory, launch kernels on the GPU, and copy results back to the CPU.
This document outlines the Linux I/O stack as of kernel version 3.3. It shows the path that I/O requests take from applications through the various layers including direct I/O, the page cache, block I/O layer, I/O scheduler, storage devices, filesystems, and network filesystems. Optional components are shown that can be stacked on top of the basic I/O stack like LVM, device mapper targets, multipath, and network transports.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
The document discusses optimizations for memory and communication in massively parallel computing. It recommends caching data in faster shared memory to reduce loads and stores to global device memory. This can improve performance by avoiding non-coalesced global memory accesses. The document provides an example of coalescing writes for a matrix transpose by first loading data into shared memory and then writing columns of the tile to global memory in contiguous addresses.
This document discusses GPU computing with CUDA and NVIDIA Tesla hardware. It provides an overview of GPU computing and how it differs from CPU computing in being optimized for data-parallel throughput rather than low latency. It also describes the key specifications of the NVIDIA Tesla C1060 GPU and Tesla streaming multiprocessor. Finally, it outlines the CUDA parallel computing architecture and programming model, including how applications use the GPU as a coprocessor through kernels launched from the CPU.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document summarizes an MIT lecture on GPU cluster programming using MPI. It provides administrative details such as homework due dates and project information. It also announces various donations of computing resources for the class, including Amazon AWS credits and a Tesla graphics card for the best project. The lecture outline covers the problem of computations too large for a single CPU, an introduction to MPI, MPI basics, using MPI with CUDA, and other parallel programming approaches.
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
Ian Buck developed GPU computing at Nvidia. CUDA 1.0 was released in 2006, allowing normal applications to utilize GPU processing for higher performance without low-level programming. A GPU can execute many more instructions per clock than a CPU due to its large number of arithmetic logic units. In CUDA, programs specify blocks and threads to distribute work across a GPU. Calling a GPU function launches the specified number of blocks with threads. This massive parallelism allows GPUs to greatly accelerate brute force searches.
1. The CUDA programming model uses parallel threads organized in cooperative thread arrays (CTAs) to execute the same program on many threads simultaneously.
2. CTAs are grouped into grids and threads within a CTA can share memory. Each CTA implements a thread block.
3. The GPU architecture has streaming multiprocessors that perform computations and global memory like CPU RAM that is accessible to both the GPU and CPU.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
This document provides an introduction to the NetBSD kernel. It discusses NetBSD's history and focus on portability across architectures. Key features of the NetBSD kernel discussed include its process scheduling, SMP support, threading model using scheduler activations, and event notification using kqueues. Debugging support via DDB and KGDB is also summarized. The document provides a brief overview of NetBSD's build system and configuration, and notes some limitations in device support. It concludes by highlighting NetBSD's clean code, documentation, and commercial support options.
The document introduces SGC Ruby CUDA, a Ruby library that provides an object-oriented API for CUDA programming to bridge Ruby and CUDA C/C++. It allows performing operations like memory allocation and transfer as well as kernel launching from Ruby. The library aims to make CUDA programming accessible from Ruby while hiding complexity of the low-level CUDA driver and runtime APIs.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
Modern NetBSD allows adding and removing kernel functionality dynamically through kernel modules. Kernel modules can be loaded and unloaded without needing to recompile or reboot the kernel. They are placed in directories under /stand and can be loaded at boot time via boot.cfg or using the modload command. Writing kernel modules is straightforward, requiring only MODULE and _modcmd declarations. However, there are some limitations around passing parameters and creating device nodes.
CUDA is a parallel computing platform that allows developers to use GPUs for general purpose processing. It provides a programming model for writing C/C++ applications that leverage the parallel compute engines on Nvidia GPUs. CUDA applications use a data-parallel programming model where the GPU runs many lightweight threads concurrently. The CUDA programming model exposes a hierarchical memory structure including registers, shared memory, and global memory. Developers can write CUDA programs that transfer data from CPU to GPU memory, launch kernels on the GPU, and copy results back to the CPU.
This document outlines the Linux I/O stack as of kernel version 3.3. It shows the path that I/O requests take from applications through the various layers including direct I/O, the page cache, block I/O layer, I/O scheduler, storage devices, filesystems, and network filesystems. Optional components are shown that can be stacked on top of the basic I/O stack like LVM, device mapper targets, multipath, and network transports.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
C for Cuda - Small Introduction to GPU computingIPALab
In this talk, we are presenting a short introduction to CUDA and GPU computing to help anyone who reads it to get started with this technology.
At first, we are introducing the GPU from the hardware point of view: what is it? How is it built? Why use it for General Purposes (GPGPU)? How does it differ from the CPU?
The second part of the presentation is dealing with the software abstraction and the use of CUDA to implement parallel computing. The software architecture, the kernels and the different types of memories are tackled in this part.
Finally, and to illustrate what has been presented previously, examples of codes are given. These examples are also highlighting the issues that may occur while using parallel-computing.
This lecture discusses manycore GPU architectures and programming, focusing on the CUDA programming model. It covers GPU execution models, CUDA programming concepts like threads and blocks, and how to manage GPU memory including different memory types like global and shared memory. It also discusses optimizing memory access patterns for global memory and profiling CUDA programs.
اسلایدهای کارگاه پردازش های موازی با استفاده از زیرساخت جی پی یو GPU
اولین کارگاه ملی رایانش ابری کشور
وحید امیری
vahidamiry.ir
دانشگاه صنعتی امیرکبیر - 1391
Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.
Graphics processing uni computer archiectureHaris456
This document discusses graphics processing units (GPUs) and their evolution from specialized hardware for graphics to general-purpose parallel processors. It covers key aspects of GPU architecture like their massively parallel threading model, SIMD execution, memory hierarchy, and how programming models like CUDA map computations to large numbers of threads grouped into blocks. Examples of GPU hardware like Nvidia's Fermi architecture are also summarized.
The document discusses various ways to optimize storage performance for virtual machines, including:
1) Provisioning virtual disks using different QEMU emulated devices like virtio-blk and configuring the IOThread option to improve performance.
2) Performing NUMA pinning to ensure virtual CPUs, memory and I/O threads are placed on the same NUMA node as the host storage device.
3) Configuring virtual machine options like using raw block devices instead of image files, enabling the IOThread, and tuning QEMU and image file parameters to improve I/O performance.
CUDA is a parallel computing platform that allows developers to use GPUs for general purpose processing. It was developed by NVIDIA and is supported on their graphics cards. CUDA extends ANSI C, allowing programmers to write kernel functions that launch many threads to run simultaneously on the GPU. The GPU architecture includes streaming multiprocessors with scalar processors that can execute billions of threads per second. CUDA provides programmers with control over the GPU memory hierarchy and tools for optimizing parallel code performance. Future developments of CUDA and platforms like OpenCL will make GPU programming more accessible across different hardware and languages.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
qCUDA-ARM : Virtualization for Embedded GPU Architectures 柏瑀 黃
This document discusses qCUDA-ARM, a virtualization solution for embedded GPU architectures. It provides an overview of edge computing challenges, GPU virtualization methods like API remoting, and the design and implementation of qCUDA-ARM. Key aspects covered include using address space conversion to enable zero-copy memory between guest and host, a pinned memory allocation mechanism for ARM, and improvements to memory bandwidth when copying data. The document evaluates qCUDA-ARM's performance on various benchmarks and applications compared to rCUDA and native CUDA.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
002 - Introduction to CUDA Programming_1.pptceyifo9332
This document provides an introduction to CUDA programming. It discusses the programmer's view of the GPU as a co-processor with its own memory, and how GPUs are well-suited for data-parallel applications with many independent computations. It describes how CUDA uses a grid of blocks of threads to run kernels in parallel. Memory is organized into global, constant, shared, and local memory. Kernels launch a grid of blocks, and threads within blocks can cooperate through shared memory and synchronization.
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
The document provides a history of GPUs and GPGPU computing. It describes how GPUs evolved from fixed hardware for graphics to programmable hardware. This allowed general purpose computing on GPUs (GPGPU). It discusses the development of GPGPU languages and APIs like CUDA, OpenCL, and DirectCompute. The anatomy of a modern GPU is explained, highlighting its massively parallel architecture. Typical GPGPU execution and memory models are outlined. Usage of GPGPU for applications like graphics, physics, computer vision, and HPC is mentioned. Leading GPU vendors and their products are briefly introduced.
1. HPC Essentials
Part IV : Introduction to GPU programming using CUDA
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu
2. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
3. Introduction
●A modern CPU handles a number of 'heavy' computational threads
concurrently; Intel MIC devices might have as many as 32 cores running
4 threads each in the near future (Knights Corner)
●As of writing it's not uncommon to find 16-24 cores on a CPU
motherboard; the smallest compute unit on recent Nvidia series devices
handle 32 threads concurrently (a warp)
●As many as 1536 threads run actively on a streaming multiprocessor
(SM), and some devices contain as many as 16 multiprocessors,
permitting upwards of 24k threads to run on a device at a given time
●GPU threads are 'lightweight', cheap to run; no context switches in the
traditional sense, scheduler runs all and any available warps, no
swapping of state, registers etc
GPU (device) is separated from CPU (host) by a PCIe bus
●
wjb19@psu.edu
4. Typical Compute Node
RAM
CPU memory bus
QuickPath Interconnect
GPU
IOH volatile storage
PCI-express
Direct Media Interface
ethernet
PCI-e cards ICH NETWORK
SATA/USB
BIOS
non-volatile storage wjb19@psu.edu
5. Introduction
●Host and device have physically distinct memory regions*; recently support
introduced for Unified Virtual Addressing (UVA) which makes development
easier; host/device can share same address space
●GPU's are suited to data parallel computation; same program on different data
elements, benefits from data coherency
●This computational model maps data elements to parallel processing threads,
also benefits from running as many threads as possible
●SM receives groups of threads, split into warps and scheduled with two units
(Fermi); instruction from each warp goes to a group of sixteen cores, sixteen
load/store units, or four SFUs.
●Warps execute one common instruction at a time (each thread), may diverge
based on conditional branches, in which case execution becomes serial or
disabled until the branch is complete
*Occasionally they may share memory eg., Mac
wjb19@psu.edu
6. Fermi Streaming Multiprocessor
Warp scheduler Warp Scheduler
Dispatch unit Dispatch unit
32768x32 bit registers
Special Function Unit x4
Load/Store Unit x16
Core x 16 x 2
CUDA core
Dispatch Port
Operand Collector
FPU Int U
Result Queue
interconnect
64kB shared mem/L1 Cache Adapted from Nvidia Fermi Whitepaper
wjb19@psu.edu
8. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
9. CUDA threads
●CUDA or Compute Unified Data Architecture refers to both the hardware
and software model of the Nvidia series GPU's; several language
bindings are available eg., C
●GPU's execute kernels, special functions which when called are
executed N times in parallel by N different CUDA threads
Defined by use of the __global__ declaration specifier in source code
●
●Number of threads required given in <<<...>>>, more on C language
extensions shortly
●Fermi and onward allows for the execution of multiple kernels
simultaneously
wjb19@psu.edu
10. CUDA threads
●threadIdx is a 3-compt vector used to reference threads with a
1D,2D or 3D index within a kernel
●A first approach porting serial code to CUDA may simply require
flattening loops and apportioning work to individual threads
Thread Index ID Size
1D x x Dx
2D (x,y) (x+y*Dx) (Dx,Dy)
3D (x,y,z) (x+y*Dx+z*Dx*Dy) (Dx,Dy,Dz)
●Threads can cooperate via shared memory, a small region
(~48kB) with much lower latency than global memory
wjb19@psu.edu
11. CUDA blocks
A kernel is executed by multiple equally shaped thread blocks, with
●
multiple blocks organized into a grid
●Each block in grid can be identified via index in kernel function, using
blockIdx variable; blocks have dimension blockDim
Thread blocks execute independently; any order, parallel or series
●
●Allows for scalable, multi-core/processor code; threads within a warp
execute SIMD style
●Execution must be synchronized via barriers in the kernel using
__syncthreads() where communication between threads in a block
is required, although this introduces overhead and should be used
sparingly
No global synchronization available, use multiple kernel invocation
●
wjb19@psu.edu
12. CUDA grid
● Threads have multiple memory options
● Per thread
● Registers → fastest
● Per block
● Shared → fast & limited
● Global DRAM → slowest & largest
●Two additional memory resources; constant (64k) and texture, the
latter allows for faster random access, also permits cheap
interpolation on load, more in the intermediate seminar
●Initial step in heterogeneous program execution is to transfer data
from host to device global memory across PCIe bus; pinned host
memory is fastest but limited
●Global memory lasts for application, shared/constant/texture only
kernel lifetime
wjb19@psu.edu
13. CUDA grid
registers
Per-Block
Block(0,0) Shared mem
Grid0
Block(0,0) Block(n,0) Per-
Application
Global
Grid1 memory
Block(0,0) Block(n,0)
Adapted from nVidia Fermi Whitepaper
wjb19@psu.edu
14. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
15. CUDA C language extensions
●CUDA provides extensions to C, targets sections of code to run on
device using specifiers
●Also comprises a runtime library split into separate host
component (on CPU) and device components, as well as a
component common to both
● Uses function and variable type specifiers :
__device__ fn executed and called from device
__global__ fn executed on device, called from host
__host__ fn executed and called from host
__constant__ constant memory
__shared__ thread block memory
wjb19@psu.edu
16. Heterogeneous Execution
Host
__host__
Serial
Parallel Grid0 __device__
Kernel0
__global__
Block(0,0) Block(n,0) __shared__
__const__
Serial
Host
Parallel Grid1
Kernel1
Block(0,0) Block(n,0)
Adapted from CUDA programming guide
wjb19@psu.edu
17. CUDA C Language Extensions
Call to __global__ must specify execution configuration for call
●
●Define dimension of the grid and blocks that will be used to execute the
function on the device, and associated stream(s)
Specified by inserting:
●
<<<Dg,Db,Ns,s>>>
btwn function name and args
Dg is of type dim3, specifies the size and dimension of the grid
●
wjb19@psu.edu
18. CUDA C Language Extensions
●Db is of type dim3, specifies the size and dimension of each block, such
that Db.x*Db.y*Db.z is number of threads / block
●(optional, default=0) Ns is of type size_t, no. of bytes in shared
memory that is dynamically allocated / block for this call, in addition to
statistically allocated mem.
●(optional, default=0) s is of type cudaStream_t, specifies associated
stream
Examples; function declared as:
●
__global__ void Func(float* parameter);
must be called as:
Func<<<Dg,Db>>>(parameter);
at a minimum
wjb19@psu.edu
19. Compilation with NVCC
●nvcc separates device & host code, compiles device code into
assembly or binary
● Onemust compile with compute capability suited to device
● Generated host code output as C or object code via host compiler
●#pragma unroll → by default compiler unrolls small loops,
control via #pragma unroll x ← times to unroll
●Use cuda-gdb to debug or dump output using printf in kernels
(compute capability 2.x)
wjb19@psu.edu
20. Practicalities
● CUDA is available as a module on lion-GA; once loaded all the tools and
libraries will be in your PATH and LD_LIBRARY_PATH respectively
●When getting started it's always a good idea to report any CUDA errors in your
code to stdout/stderr using the various methods eg., a little wrapper around
cudaGetLastError():
void reportCUDAError(const char * kern)
{
fprintf(stderr,"fired kernel %sn",kern);
fprintf(stderr,"kernel result : %sn",
cudaGetErrorString(cudaGetLastError()));
}
●The nvidia-smi utility gives useful information too with regards to device(s)
and any running applications eg., running processes, memory used etc
wjb19@psu.edu
21. Practicalities
●Remember that there are various devices available per node, you can select
them with cudaSetDevice()
At the time of writing lion-GA has three per node ie., in a shared or distributed
●
memory implementation you have the liberty use all three if available :
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//TESTME
//use devices with best bandwidth
int myDevice = (rank % 3);
cudaSetDevice(myDevice);
●Set and possibly force the topology you require in your PBS script,
devices are set to exclusive, one process per GPU
wjb19@psu.edu
22. Simple CUDA program I
#include <cuda.h>
#include <stdio.h>
#include <math.h>
#define M_PI 3.14159265358979323846
//unimaginative demo for accuracy and scaling test
__device__ float trivialFunction(float theta){
return M_PI * (cos(theta)*cos(theta) + sin(theta)*sin(theta));
}
__global__ void trivialKernel(float* input, float* output){
//my index
const int myIndex = blockDim.x * blockIdx.x + threadIdx.x;
//load data
float theta = M_PI * input[myIndex];
//do some work and write out
output[myIndex] = trivialFunction(theta);
} //end kernel
wjb19@psu.edu
23. Simple CUDA program II
extern "C"{
__host__ void trivialKernelWrapper(float* input, float* output, int size){
//device pointers
float *devInput, *devOutput;
cudaMalloc((void**) &devInput, size * sizeof(float));
cudaMalloc((void**) &devOutput, size * sizeof(float));
// Load device memory
cudaMemcpy(devInput, input, size*sizeof(float), cudaMemcpyHostToDevice);
//setup the grid
dim3 threads,blocks;
threads.x=512;
blocks.x= size / 512;
//run
trivialKernel<<<blocks, threads>>>(devInput, devOutput);
//transfer back to host
cudaMemcpy(output, devOutput, size * sizeof(float), cudaMemcpyDeviceToHost);
//cleanup
cudaFree(devInput);
cudaFree(devOutput);
} //end host function
} //end extern wjb19@psu.edu
24. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
25. Amdahl's Law revisited
●No less relevant for GPU; in the limit as processors N → ∞ we find the
maximum performance improvement :
1/(1-P)
●It is helpful to see the 3dB points for this limit ie., the number of processing
elements N1/2 required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with
Amdahl's law & after some algebra :
N1/2 = 1/((1-P)*(√2-1))
300
250
200
N1/2
150
100
50
0
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
Parallel code fraction P wjb19@psu.edu
26. Amdahl's law : implications
Points to note from the graph :
●
● P ~ 0.90, we can benefit from ~ 20 elements
● P ~ 0.99, we can benefit from ~ 256 elements
● P → 1, we approach the “embarrassingly parallel” limit
● P ~ 1, performance improvement directly proportional to elements
● P ~ 1 implies independent or batch processes
Want at least P ~ 0.9 to justify work of porting code to accelerator
●
●For many problems in sciences, as characteristic size increases, P
approaches 1 and scaling benefits outweigh cost of porting code (cf
weak vs strong scaling)
●Regardless, doesn't remove the need to profile and clearly target
suitable sections of serial code
wjb19@psu.edu
28. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
29. LU Algorithm
●Used for linear equations solution, method also for finding matrix
determinant; application in this case is to very large number of small
matrix determinants ie., product of diagonal elements after LU
Based on Crouts method; if we factor matrix as lower/upper (eg., 3x3):
●
then we are left with a set of overdetermined equations to solve; diagonal
of alpha is set to one, writes of alpha/beta done in place within a
●In solving overdetermined set, numerical stability requires finding
maximum pivot, a dividing element in the equation's solution that may
entail a row permutation
wjb19@psu.edu
30. Porting LU Decomposition to GPU
Matrix size N*N ~ 12*12 elements; block size (x,y) in threads ?
●
● As many threads as matrix elements??
→ Too many dependencies in algorithm and implied serialization
● One thread per matrix??
→ Completely uncoalesced read/writes from global memory
→ ~ 800 clock cycles each, too expensive
●What about shared memory ~ 1-32 cycles per access??
● Should make block size x 32 (warp size), promotes coalescing
and efficiency, memory accesses ~ hidden behind computation
● Thus we need at least 32*(4*144) = 18 kB / block
→ only have 48k per SM which has 8+ cores, not enough to go round
● Start with ~ N==matrixSide threads per matrix
wjb19@psu.edu
31. Digression : Coalesced Memory
Access
Highest priority for performant code, refers to load/stores from/to global
●
memory organized into smallest number of transactions possible
●On devices with compute capability 2.x, number of transactions required
to load data by threads of a warp is equal to the number of cache lines
required to service the warp (L1 has 128 byte cache lines eg., 32 floats
per line)
●Aligned data or offset/stride a multiple of 32 is best; non-unit stride
degrades memory bandwidth as more cache lines are required; worst
case scenario 32 cache lines required for 32 threads
0 32 64 96 128 160 192 224 256 288 320 352 384
Coalesced access, all threads
●
load/store from same line wjb19@psu.edu
32. Digression : Shared Memory
●Is on chip, lower latency/higher bandwidth than global memory;
organized into banks that may be processed simultaneously
Note that unless data will re-used there is no point loading into shared
●
memory
Shared memory accesses that map to the same bank must be serialized
●
→ bank conflicts
●Successive 32 bit words are assigned to successive banks (32 banks
for Fermi); a request without conflicts can be serviced in as little as 1
clock cycle
0 31
Example 2-way bank conflict wjb19@psu.edu
33. Shared Memory : Broadcast
●Looks like a bank conflict, but when all threads in a half warp access the
same shared memory location, results in an efficient broadcast operation
eg., nearest neighbor type access in an MD code :
scratchA[threadIdx.x] = tex1Dfetch(texR_NBL,K+0*nAtoms_NBL);
scratchB[threadIdx.x] = tex1Dfetch(texR_NBL,K+1*nAtoms_NBL);
scratchC[threadIdx.x] = tex1Dfetch(texR_NBL,K+2*nAtoms_NBL);
scratchD[threadIdx.x] = K;
__syncthreads();
//Loop over all particles belonging to an adjacent cell
//one thread per particle in considered cell
for(int j = 0;j<256;j++){
pNx = scratchA[j];
pNy = scratchB[j];
pNz = scratchC[j];
K = scratchD[j];
●Multi-cast is possible in devices with compute arch 2.x; but I digress,
back to example...
wjb19@psu.edu
34. First Step : Scaling Information
for (int i=0;i<matrixSide;i++) {
big=0.0;
for (int j=0;j<matrixSide;j++)
if ((temp=fabs(inputMatrix[i*matrixSide+j])) > big)
big=temp;
if (big < TINY)
return 0.0; //Singular Matrix/zero entries
scale[i]=1.0/big;
}
●Threads can work independently, scanning rows for the largest
element; map outer loop to thread index
int i = threadIdx.x;
big=0.0;
for (int j=0;j<matrixSide;j++)
if ((temp=fabs(inputMatrix[i*matrixSide+j])) > big)
big=temp;
if (big < TINY)
return 0.0; //Singular Matrix/zero entries
wjb19@psu.edu
scale[i]=1.0/big;
36. Second Step : First Equations
for (int j=0;j<matrixSide;j++) {
for (int i=0;i<j;i++) {
sum=inputMatrix[i][j];
for (int k=0;k<i;k++)
sum -= inputMatrix[i][k]*inputMatrix[k][j];
inputMatrix[i][j]=sum;
}
●Threads can't do columns independently due to pivoting/possible row
swaps later in algorithm; only threads above diagonal work
for (int j=0;j<matrixSide;j++) {
__syncthreads(); //ensure accurate data from lower columns/j
if (i<j) {
sum=inputMatrix[i][j];
for (int k=0;k<i;k++)
sum -= inputMatrix[i][k]*inputMatrix[k][j];
inputMatrix[i][j]=sum;
}
wjb19@psu.edu
37. Second Step : Memory Access
●Simple Octave scripts make your life easier when visualizing memory
access/load balancing etc
for j=0:9
disp('---');
disp(['loop j = ',num2str(j)]);
disp(' ');
for i=0:j-1
disp(['read : ',num2str(i),",",num2str(j)]);
for k=0:i-1
disp(['read : ',num2str(i),",",num2str(k),"
",num2str(k),",",num2str(j)]);
end
disp(['write : ',num2str(i),",",num2str(j)]);
end
end
wjb19@psu.edu
38. Second Step : Memory Access
●Data from previous script is readily visualized using opencalc; the load
balancing in this case is fairly poor; on average ~ 50% of threads work,
and the amount of work performed increases with threadIdx.x
Potential conflict/race??
●
wjb19@psu.edu
39. Second Step : Memory Access
● Take j=2 for example; thread 0 writes the location (0,2) thread 1 expects to read
●However, within a warp, recall that threads execute SIMD style (single
instruction-multiple data); regardless of thread execution order, a memory
location is written before it is required by a thread in this example
0 1 2 N-1
0 ● Also recall that blocks which are
composed of warps are scheduled
1 and executed in any order, we are
lucky in this case that the threads
2 working on the same matrix will be a
. warp or less
.
.
N-1
wjb19@psu.edu
40. Third Step : Second Equations
big = 0.0;
if ((i > j) && (i < matrixSide)) {
sum=inputMatrix[i][j];
for (int k=0;k<j;k++)
sum -= inputMatrix[i][k]*inputMatrix[k][j];
inputMatrix[i][j]=sum;
...
}
● Now we work from diagonal down; memory access is similar to before, just
'flipped'
●This time increasingly less threads are used; note also we ignore i==j since
this just reads and writes, no -= performed
wjb19@psu.edu
41. Fourth Step : Pivot Search
big=0.0;
if ((i > j) && (i < matrixSide)) {
...
if ((dum=holder[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
Here we search the valid rows for largest value and corresponding index
●
●We will use shared memory to allow thread cooperation, specifically
through parallel reduction
__shared__ float big[matrixSide];
__shared__ int ind[matrixSide];
__shared__ float bigs;
__shared__ int inds;
wjb19@psu.edu
42. Fourth Step : Pivot Search
big[threadIdx.x]=0.0;
ind[threadIdx.x]=threadIdx.x;
if ((i > j) && (i < matrixSide)) {
...
big[threadIdx.x] = scale[i]*fabs(sum);
}//note we close this scope, need lowest thread(s) active for
//reduction
__syncthreads();
//perform parallel reduction
...
//lowest thread writes out to scalars
if (threadIdx.x==0){
bigs = big[0];
imax = ind[0];
}
__syncthreads();
wjb19@psu.edu
44. Fifth Step : Row Swap
if (j != imax) {
for (int k=0;k<matrixSide;k++) {
dum=inputMatrix[imax][k];
inputMatrix[imax][k]=inputMatrix[j][k];
inputMatrix[j][k]=dum;
}
//should also track permutation of row order
scale[imax]=scale[j];
}
●Each thread finds the same conditional result, no warp divergence (a very good
thing); if they swap rows, use same index i as before, one element each :
if (j != imax) {
dum=inputMatrix[imax][i];
inputMatrix[imax][i]=inputMatrix[j][i];
inputMatrix[j][i]=dum;
if (threadIdx.x==0) scale[imax]=scale[j];
}
wjb19@psu.edu
45. Sixth Step : Final Scaling
if (j != matrixSide-1){
dum=1.0/inputMatrix[j*matrixSide+j];
for (int i=j+1;i<matrixSide;i++)
inputMatrix[i*matrixSide+j] *= dum;
}
Use a range of threads again
●
if (j != matrixSide-1){
dum=1.0/inputMatrix[j*matrixSide+j];
if ((i>=j+1) && (i<matrixSide))
inputMatrix[i*matrixSide+j] *= dum;
}
__syncthreads();
Next steps → compile; get it working/right answer, optimized...
●
wjb19@psu.edu
46. Outline
●Introduction
●Nvidia GPU's
●CUDA
● Threads/blocks/grid
● C language extensions
● nvcc/cuda-gdb
● Simple Example
●Profiling Code
●Parallelization Case Study
● LU decomposition
●Optimization/Performance
● Tools
● Tips
wjb19@psu.edu
47. Occupancy Calculator
●Once you've compiled and have things working, you can enter
threads/block, registers/thread and shared memory from compiler output
to the occupancy calculator spreadsheet
●Use this tool to increase occupancy which is the ratio of active warps to
total available
wjb19@psu.edu
48. Timing
●Time and understand your kernel; consider viewing operations in a
spreadsheet as before and load balancing, giving more work to shorter
threads eg., by placing extra loop limits in const memory
cudaEvent_t start,stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
myKernel<<<blocks, threads>>>(...);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("%fn",time);
wjb19@psu.edu
49. CUDA profilers
●Use the profilers to understand your code and improve performance by
refactoring, changing memory access and layout, block size etc
nvvp
computeprof
wjb19@psu.edu
50. Precision
Obvious performance differences btwn single and double precision,
●
much easier to generate floating point exceptions in the former
●Solution → mixed precision, algorithms for correction & error checking,
removing need for oft-times painful debug
#ifndef _FPE_DEBUG_INFO_
#define _FPE_DEBUG_INFO_ printf("fpe detected in %s around %d,
dump follows :n",__FILE__,__LINE__);
#endif
#define fpe(x) (isnan(x) || isinf(x))
#ifdef _VERBOSE_DEBUG_
if (fpe(FTx)){
_FPE_DEBUG_INFO_
printf("FTx : %f threadIdx %in",FTx,threadIdx.x);
asm(“trap;”);
}
#endif wjb19@psu.edu
51. Pinned Memory
cudaError_t cudaMallocHost(void ** ptr, size_t size)
Higher bandwidth memory transfers from device to host : from the
●
manual :
Allocates size bytes of host memory that is page-locked and accessible to the device. The
driver tracks the virtual memory ranges allocated with this function and automatically
accelerates calls to functions such as cudaMemcpy*(). Since the memory can be
accessed directly by the device, it can be read or written with much higher bandwidth than
pageable memory obtained with functions such as malloc(). Allocating excessive amounts
of memory with cudaMallocHost() may degrade system performance, since it reduces the
amount of memory available to the system for paging. As a result, this function is best
used sparingly to allocate staging areas for data exchange between host and device
wjb19@psu.edu
52. Driver & Kernel
●Driver is (for most of us) a black box, signals can elicit undesirous/non-
performant responses :
[wjb19@lionga1 src]$ strace -p 16576
Process 16576 attached - interrupt to quit
wait4(-1, 0x7fff1bfbed6c, 0, NULL) = ? ERESTARTSYS (To be
restarted)
--- SIGHUP (Hangup) @ 0 (0) ---
rt_sigaction(SIGHUP, {0x4fac80, [HUP], SA_RESTORER|SA_RESTART,
0x3d4d232980}, {0x4fac80, [HUP], SA_RESTORER|SA_RESTART,
0x3d4d232980}, 8) = 0
●Driver/kernel bugs, conflicts; make sure driver, kernel and indeed device
are suited:
[wjb19@tesla1 bin]$ more /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 285.05.09 Fri Sep
23 17:31:57 PDT 2011
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)
[wjb19@tesla1 bin]$ uname -r
2.6.18-274.7.1.el5 wjb19@psu.edu
53. Compiler
● Compiler options (indeed compiler choice : --nvvm --open64) can
make huge differences :
● Fast math (--use_fast_math)
● Study output of ptx optimizing assembler eg., for register usage,
avoid spillage( --ptxas-options=-v)
ptxas info : Compiling entry function '_Z7NBL_GPUPiS_S_S_S_' for
'sm_20'
ptxas info : Function properties for _Z7NBL_GPUPiS_S_S_S_
768 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z9make_int4iiii
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __int_as_float
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for fabsf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for sqrtf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 49 registers, 3072+0 bytes smem, 72 bytes cmem[0], 48
bytes cmem[14]
wjb19@psu.edu
54. Communication : Inifiniband/MPI
● GPUDirect ; Initial results encouraging for performing network transfers on
simple payloads between devices:
[wjb19@lionga scratch]$ mpicc -I/usr/global/cuda/4.1/cuda/include
-L/usr/global/cuda/4.1/cuda/lib64 mpi_pinned.c -lcudart
[wjb19@lionga scratch]$ qsub test_mpi.sh
2134.lionga.rcc.psu.edu
[wjb19@lionga scratch]$ more test_mpi.sh.o2134
Process 3 is on lionga7.hpc.rcc.psu.edu
Process 0 is on lionga8.hpc.rcc.psu.edu
Process 1 is on lionga8.hpc.rcc.psu.edu
Process 2 is on lionga8.hpc.rcc.psu.edu
Host->device bandwidth for process 3: 2369.949046 MB/sec
Host->device bandwidth for process 1: 1847.745750 MB/sec
Host->device bandwidth for process 2: 1688.932426 MB/sec
Host->device bandwidth for process 0: 1613.475749 MB/sec
MPI send/recv bandwidth: 4866.416857 MB/sec
● Single IOH supports 36 PCIe lanes, NVIDIA recommend 16 per device;
however IOH doesn't currently support P2P between GPU devices on
different chipsets
wjb19@psu.edu
55. Communication : P2P
●For this reason, difficult to achieve peak transfer rates between GPU's
separated by QPI eg.,
>cudaMemcpy between GPU0 and GPU1: 316.54 MB/s
>cudaMemcpy between GPU0 and GPU2: 316.55 MB/s
>cudaMemcpy between GPU1 and GPU0: 316.86 MB/s
>cudaMemcpy between GPU2 and GPU0: 316.93 MB/s
>cudaMemcpy between GPU1 and GPU2: 3699.74 MB/s
>cudaMemcpy between GPU2 and GPU1: 3669.23 MB/s
[wjb19@lionga1 ~]$ lspci -tvv
-+-[0000:12]-+-01.0-[1d]--
| +-02.0-[1e]--
| +-03.0-[17-1b]----00.0-[18-1b]--+-04.0-[19]--
| | +-10.0-[1a]--
| | -14.0-[1b]--+-00.0 nVidia Corporation Device 1091
| | -00.1 nVidia Corporation GF110 High...
| +-04.0-[1f]--
| +-05.0-[20]--
| +-06.0-[21]--
| +-07.0-[13-16]----00.0-[14-16]--+-04.0-[15]--+-00.0 nVidia Corporation Device 1091
| | | -00.1 nVidia Corporation GF110 High...
...
...
+-05.0-[05]----00.0 Mellanox Technologies MT26438 [ConnectX VPI PCIe 2.0 5GT/s ...
+-06.0-[0e]--
+-07.0-[06-0a]----00.0-[07-0a]--+-04.0-[08]--+-00.0 nVidia Corporation Device 1091
| | -00.1 nVidia Corporation GF110 High...
| +-10.0-[09]-- wjb19@psu.edu
| -14.0-[0a]--
56. Summary
●CUDA and Nvidia GPU's in general provide an exciting and accessible
platform for introducing acceleration particularly to scientific applications
●Writing effective kernels relies on mapping the problem to the memory
and thread hierarchy, ensuring coalesced loads and high occupancy to
name a couple of important optimizations
●Nvidia refer to the overall porting process as APOD →
assess,parallelize,optimize,deploy
To accomplish this cycle, one must balance many factors:
●
● Profiling & readying serial code for porting
● Optimal tiling &communication
● P2P,Network,Host/device
● Compiler and other optimizations
● Application/kernel/driver interactions
● Precision etc etc
wjb19@psu.edu
57. References
●CUDA C Best Practices Guide January (2012)
●NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009)
●NVIDIA CUDA C Programming Guide 4.0
●NVIDIA GPUDirectTM Technology – Accelerating GPU-based Systems
(Mellanox)
●CUDA Toolkit 4.0 Overview
●Numerical Recipes in C
Acknowledgements
●Sreejith Jaya/physics
●Pierre-Yves Taunay/aero
●Michael Fenn/rcc
●Jason Holmes/rcc
wjb19@psu.edu