More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
The document provides an introduction to GPU programming using CUDA. It outlines GPU and CPU architectures, the CUDA programming model involving threads, blocks and grids, and CUDA C language extensions. It also discusses compilation with NVCC, memory hierarchies, profiling code with Valgrind/Callgrind, and Amdahl's law in the context of parallelization. A simple CUDA program example is provided to demonstrate basic concepts like kernel launches and data transfers between host and device memory.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document discusses GPU computing with CUDA and NVIDIA Tesla hardware. It provides an overview of GPU computing and how it differs from CPU computing in being optimized for data-parallel throughput rather than low latency. It also describes the key specifications of the NVIDIA Tesla C1060 GPU and Tesla streaming multiprocessor. Finally, it outlines the CUDA parallel computing architecture and programming model, including how applications use the GPU as a coprocessor through kernels launched from the CPU.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
The document discusses parallel computing using GPUs and CUDA. It introduces CUDA as a parallel programming model that allows writing parallel code in a C/C++-like language that can execute efficiently on NVIDIA GPUs. It describes key CUDA abstractions like a hierarchy of threads organized into blocks, different memory spaces, and synchronization methods. It provides an example of implementing parallel reduction and discusses strategies for mapping algorithms to GPU architectures. The overall message is that CUDA makes massively parallel computing accessible using a familiar programming approach.
The document provides an introduction to GPU programming using CUDA. It outlines GPU and CPU architectures, the CUDA programming model involving threads, blocks and grids, and CUDA C language extensions. It also discusses compilation with NVCC, memory hierarchies, profiling code with Valgrind/Callgrind, and Amdahl's law in the context of parallelization. A simple CUDA program example is provided to demonstrate basic concepts like kernel launches and data transfers between host and device memory.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document discusses GPU computing with CUDA and NVIDIA Tesla hardware. It provides an overview of GPU computing and how it differs from CPU computing in being optimized for data-parallel throughput rather than low latency. It also describes the key specifications of the NVIDIA Tesla C1060 GPU and Tesla streaming multiprocessor. Finally, it outlines the CUDA parallel computing architecture and programming model, including how applications use the GPU as a coprocessor through kernels launched from the CPU.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document provides an overview of MXF and AAF file formats. It discusses:
1. Why these formats were developed, which was to allow for content-centric workflows with metadata handling, random access to material, and open standardized compression-independent formats.
2. What the formats are, with MXF being a wrapper format for interchange of finished audiovisual material and metadata, and AAF being a more complex wrapper of metadata and essence for post-production interchange.
3. Some key concepts around the formats, including the source reference chain that allows tracking material origins and derivations, and operational patterns that control complexity.
The document introduces SGC Ruby CUDA, a Ruby library that provides an object-oriented API for CUDA programming to bridge Ruby and CUDA C/C++. It allows performing operations like memory allocation and transfer as well as kernel launching from Ruby. The library aims to make CUDA programming accessible from Ruby while hiding complexity of the low-level CUDA driver and runtime APIs.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
The document discusses the ext2 file system used in early Linux distributions, including its use of inodes to store file metadata and pointers to data blocks, and how it uses direct, single indirect, double indirect, and triple indirect blocks to address files larger than can fit in the inode's direct block pointers. It also explains how unclean unmounts could corrupt the file system and led to the journaling in ext3 to prevent data loss.
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
This document summarizes a presentation about using CUDA (Compute Unified Device Architecture) to accelerate lattice quantum chromodynamics (QCD) calculations. CUDA is used to parallelize computations across many GPU threads. Each thread processes one lattice site, with neighboring sites and links accessed sequentially. Initially, each thread required 1.4KB of local storage, limiting occupancy. Occupancy was improved by storing data in registers instead of shared memory, expanding loops explicitly. This achieved up to 82 gigabytes per second on a GTX 280, 20 times faster than CPUs. Memory access patterns, float4 arrays, and textures were optimized to improve bandwidth utilization.
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...
Shader Model 5.0 introduces several new features for vertex, hull, domain, geometry, and pixel shaders, including uniform indexing of resources, SV_Coverage system value, and double precision support. Compute shaders also gain features like raw and structured buffer views, atomic operations, and thread local storage. Compute shaders are well-suited for general purpose GPU tasks like post-processing and can perform Gaussian blur more efficiently than pixel shaders by reducing memory bandwidth usage through thread local storage.
This document discusses cache memory. It describes the location, capacity, unit of transfer, access methods, and physical types of caches. Common cache organizations include direct mapping, set associative mapping, and replacement algorithms like LRU. Write policies can be write-through or write-back. Example architectures discussed include Pentium 4 and PowerPC caches.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
The document discusses Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model created by Nvidia that allows software developers to use GPUs for general-purpose processing. It provides an overview of CUDA, including its execution model, implementation details, applications, and advantages/drawbacks. The document also covers CUDA programming, compiling CUDA code, CUDA architectures, and concludes that CUDA has brought significant innovations to high performance computing.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
This document provides an overview of CUDA architecture and programming. It discusses key CUDA concepts like the host/device model, CUDA C extensions, GPU memory management, and parallel programming using CUDA threads and blocks. CUDA allows developers to speed up applications by offloading work to the GPU. It provides a scalable parallel programming model that maps threads to GPU threads to express data-level parallelism across thousands of lightweight threads for applications like high-bandwidth computing and visual computing.
This document provides an overview of MXF and AAF file formats. It discusses:
1. Why these formats were developed, which was to allow for content-centric workflows with metadata handling, random access to material, and open standardized compression-independent formats.
2. What the formats are, with MXF being a wrapper format for interchange of finished audiovisual material and metadata, and AAF being a more complex wrapper of metadata and essence for post-production interchange.
3. Some key concepts around the formats, including the source reference chain that allows tracking material origins and derivations, and operational patterns that control complexity.
The document introduces SGC Ruby CUDA, a Ruby library that provides an object-oriented API for CUDA programming to bridge Ruby and CUDA C/C++. It allows performing operations like memory allocation and transfer as well as kernel launching from Ruby. The library aims to make CUDA programming accessible from Ruby while hiding complexity of the low-level CUDA driver and runtime APIs.
This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
The document discusses the ext2 file system used in early Linux distributions, including its use of inodes to store file metadata and pointers to data blocks, and how it uses direct, single indirect, double indirect, and triple indirect blocks to address files larger than can fit in the inode's direct block pointers. It also explains how unclean unmounts could corrupt the file system and led to the journaling in ext3 to prevent data loss.
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
This document summarizes a presentation about using CUDA (Compute Unified Device Architecture) to accelerate lattice quantum chromodynamics (QCD) calculations. CUDA is used to parallelize computations across many GPU threads. Each thread processes one lattice site, with neighboring sites and links accessed sequentially. Initially, each thread required 1.4KB of local storage, limiting occupancy. Occupancy was improved by storing data in registers instead of shared memory, expanding loops explicitly. This achieved up to 82 gigabytes per second on a GTX 280, 20 times faster than CPUs. Memory access patterns, float4 arrays, and textures were optimized to improve bandwidth utilization.
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...
Shader Model 5.0 introduces several new features for vertex, hull, domain, geometry, and pixel shaders, including uniform indexing of resources, SV_Coverage system value, and double precision support. Compute shaders also gain features like raw and structured buffer views, atomic operations, and thread local storage. Compute shaders are well-suited for general purpose GPU tasks like post-processing and can perform Gaussian blur more efficiently than pixel shaders by reducing memory bandwidth usage through thread local storage.
This document discusses cache memory. It describes the location, capacity, unit of transfer, access methods, and physical types of caches. Common cache organizations include direct mapping, set associative mapping, and replacement algorithms like LRU. Write policies can be write-through or write-back. Example architectures discussed include Pentium 4 and PowerPC caches.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
The document outlines a course on GPU architecture and CUDA programming. It covers GPU architecture overview, CUDA tools, CUDA C introduction, parallel computing patterns, thread cooperation and synchronization, memory types, atomic operations, events and streams, and advanced CUDA scenarios. Programming GPUs requires a CUDA capable GPU.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
The document summarizes upgrades made to the SVG supercomputer in 2012, including:
- Upgrading to Sandy Bridge processors with 192 cores and 1.5TB memory on thin nodes and 512GB memory on fat nodes.
- Installing an Infiniband FDR 56Gb/s network with 4Tb/s bandwidth and 1us MPI latency.
- Configuring queues to take advantage of the Infiniband network and turbo boost, allowing up to 112 cores and 1024GB memory per job.
- Benchmark results showed peak performance of 3788 GFlops on thin nodes and 563 GFlops on fat nodes.
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
This document summarizes a research paper that analyzes the performance of the RSA cryptographic algorithm when implemented in parallel using Nvidia's CUDA framework. It first describes the traditional RSA algorithm and CUDA architecture. It then discusses how RSA was designed for implementation in CUDA, with encryption and decryption operations parallelized across GPU threads and cores. The results show that GPU parallelization provides significant speedups compared to CPU-only implementation, with performance increasing based on the number of threads used.
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages toolslaparuma
The document discusses GPU computing software and development tools from NVIDIA. It provides an overview of language and API options for GPU computing like CUDA, OpenCL, and DirectCompute. It also describes development tools like the CUDA toolkit, Parallel Nsight debugger, and Visual Profiler for optimizing GPU applications. Foundation libraries and common design patterns are also covered to help developers leverage GPUs effectively.
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
The document discusses parallelizing a 1D heat equation simulation using CUDA. It begins with an overview of CUDA and the heat equation model. It then describes how to discretize and parallelize the heat equation using an explicit method. The key steps are:
1) Discretize the PDE into a finite difference equation using an explicit update rule between grid points over time.
2) Parallelize the computation by assigning each CUDA thread to update one grid point simultaneously, allowing the entire spatial domain to be updated in parallel at each time step.
3) Implement the algorithm by allocating memory on the GPU, launching kernels to perform the update in parallel, and copying data back to check results.
CUDA is a parallel computing platform that allows developers to use GPUs for general purpose processing. It was developed by NVIDIA and is supported on their graphics cards. CUDA extends ANSI C, allowing programmers to write kernel functions that launch many threads to run simultaneously on the GPU. The GPU architecture includes streaming multiprocessors with scalar processors that can execute billions of threads per second. CUDA provides programmers with control over the GPU memory hierarchy and tools for optimizing parallel code performance. Future developments of CUDA and platforms like OpenCL will make GPU programming more accessible across different hardware and languages.
This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
This document summarizes Nvidia's GPU technology conference (GTC16) including announcements about their Tesla P100 GPU and DGX-1 deep learning supercomputer. Key points include:
- The new Tesla P100 GPU delivers up to 21 teraflops of performance for deep learning and uses new technologies like NVLink, HBM2 memory, and a page migration engine.
- The Nvidia DGX-1 is a deep learning supercomputer powered by 8 Tesla P100 GPUs with over 170 teraflops of performance for training neural networks.
- CUDA 8 and unified memory improvements on the P100 enable simpler programming and larger datasets by allowing allocations beyond GPU memory size and
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT) (20)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
This document discusses preliminary work using machine learning techniques to help improve blockchain security. It outlines initial experiments using a Cosmos SDK simulator to generate test data and identify "bug correlates" that could help predict vulnerabilities. Several bugs were already found in the simulator itself. The goal is to focus compute resources on more interesting test runs likely to produce bugs. This is an encouraging first step in exploring how AI may augment blockchain security testing.
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
This document discusses using high-performance computing for machine learning tasks like analyzing large convolutional neural networks for visual object recognition. It proposes running hundreds of thousands of large neural network models in parallel on GPUs to more efficiently search the parameter space, beyond what is normally possible with a single graduate student and model. This high-throughput screening approach aims to identify better performing network architectures through exploring a vast number of possible combinations in the available parameter space.
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
The document discusses challenges with parallel programming on GPUs including tasks with statically known data dependences, SIMD divergence, lack of fine-grained synchronization and writeable coherent caches. It also presents performance results for sorting algorithms on different GPU and CPU architectures, with GPUs providing much higher sorting throughput than CPUs. Parallel prefix sum is proposed as a method for allocating work in parallel tasks that require dynamic scheduling or allocation.
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
The document discusses changes in computer architecture and Microsoft's role in the transition to parallel computing. It notes that computer cores are increasing rapidly and that Microsoft aims to make parallelism accessible to all developers through tools like Visual Studio. It also outlines Microsoft's involvement in GPU computing through technologies like DirectX and efforts to support GPU programming across its software stack.
The document discusses dynamic compilation for massively parallel processors. It describes how execution models provide an interface between programming languages and hardware architectures. Emerging execution models like bulk-synchronous parallel and PTX aim to abstract parallelism on heterogeneous multi-core and many-core processors. The document outlines how dynamic compilers can translate between execution models and target instructions to different core architectures through techniques like thread fusion, vectorization, and subkernel extraction. This bridging of models and architectures through just-in-time compilation helps program entire processors rather than individual cores.
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
The document describes the R-Stream high-level program transformation tool. It provides an overview of R-Stream, walks through the compilation process, and discusses performance results. R-Stream uses the polyhedral model to perform program transformations like loop transformations, fusion, distribution and tiling to optimize for parallelism and locality. It models the target machine and uses this to inform the mapping of operations to resources like GPUs.
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
The document discusses irregular parallelism on GPUs and presents several algorithms and data structures for handling irregular workloads efficiently in parallel. It covers sparse matrix-vector multiplication using different sparse matrix formats. It also discusses compositing of fragments in parallel and presents a nested data parallel approach. The document describes challenges with parallel hashing and presents a two-level hashing scheme. It analyzes parallel task queues and work stealing techniques for load balancing irregular work. Throughout, it focuses on managing communication in addition to computation for optimal parallel performance.
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
This document discusses performance optimization of GPU kernels. It outlines analyzing kernels to determine if they are limited by memory bandwidth, instruction throughput, or latency. The profiler can identify limiting factors by comparing memory transactions and instructions issued. Source code modifications for memory-only and math-only versions help analyze memory vs computation balance and latency hiding. The goal is to optimize kernels by addressing their most significant performance limiters.
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
This document discusses performance optimization of GPU kernels. It outlines analyzing kernels to determine if they are limited by memory bandwidth, instruction throughput, or latency. The profiler can identify limiting factors by comparing memory transactions and instructions issued. Source code modifications for memory-only and math-only versions help analyze memory vs computation balance and latency hiding. The goal is to optimize kernels by addressing their most significant performance limiters.
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
This document summarizes a paper about using high-level programming languages for low-level systems programming. It discusses the needs of scientists and engineers for software that is reliable, high-performance, and customizable. The paper aims to address these needs by exploring features of high-level languages that could enable low-level programming tasks typically done in C/C++, like developing device drivers, operating systems, and embedded systems.
This document outlines Andreas Klockner's presentation on GPU programming in Python using PyOpenCL and PyCUDA. The presentation covers an introduction to OpenCL, programming with PyOpenCL, run-time code generation, and perspectives on GPU programming in Python. OpenCL provides a common programming framework for heterogeneous parallel programming across CPUs, GPUs, and other processors. PyOpenCL and PyCUDA allow GPU programming from Python.
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
Creating cluster 'mycluster' with the following settings:
- Master node: m1.small using ami-fce3c696
- Number of nodes: 1
- Node type: m1.small
- Node AMI: ami-fce3c696
- Storage: EBS volume of size 10 GB
- Security group: mycluster-sg allowing SSH from anywhere
Launching instances...
This may take a few minutes. You can check progress with 'starcluster list'.
When instances have started, SSH will be automatically configured.
You can now ssh to the master with:
starcluster ssh mycluster
Have fun and please let us know if you have
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
This document summarizes an MIT lecture on GPU cluster programming using MPI. It provides administrative details such as homework due dates and project information. It also announces various donations of computing resources for the class, including Amazon AWS credits and a Tesla graphics card for the best project. The lecture outline covers the problem of computations too large for a single CPU, an introduction to MPI, MPI basics, using MPI with CUDA, and other parallel programming approaches.
This document summarizes a lecture on CUDA Ninja Tricks given on March 1st, 2011. The lecture covered scripting GPUs with PyCUDA, meta-programming and RTCG, and a case study in brain-inspired AI. It included sections on why scripting is useful for GPUs, an introduction to GPU scripting with PyCUDA, and a hands-on example of a simple PyCUDA program that defines and runs a CUDA kernel to double the values in a GPU memory array.
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
The document discusses optimizations for memory and communication in massively parallel computing. It recommends caching data in faster shared memory to reduce loads and stores to global device memory. This can improve performance by avoiding non-coalesced global memory accesses. The document provides an example of coalescing writes for a matrix transpose by first loading data into shared memory and then writing columns of the tile to global memory in contiguous addresses.
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
This document provides an overview and summary of key points from a lecture on massively parallel computing using CUDA. The lecture covers CUDA language and APIs, threading and execution models, memory and communication, tools, and libraries. It discusses the CUDA programming model including host and device code, threads and blocks, and memory allocation and transfers between the host and device. It also summarizes the CUDA runtime and driver APIs for launching kernels and managing devices at different levels of abstraction.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
1. 6.963
IT /
A@M
CUD
9
IAP0
Supercomputing on your desktop:
Programming the next generation of cheap
and massively parallel hardware using CUDA
Lecture 04
Nicolas Pinto (MIT)
#1
CUDA - Advanced
2. During this course,
3
6
for 6.9
ed
adapt
we’ll try to
“ ”
and use existing material ;-)
25. ary
ibr
L
CUDA libraries
CUDA includes 2 widely used libraries
CUBLAS: BLAS implementation
CUFFT: FFT implementation
CUDPP (Data Parallel Primitives), available from
http://www.gpgpu.org/developer/cudpp/ :
Reduction
Scan
Sort
9
M02: High Performance Computing with CUDA
26. ary
ibr
L
Closely Coupled CPU-GPU
Function Function Function
Lib Lib
Init
GPU
Alloc
CPU
Operation 1 Operation 2 Operation 3
Integrated programming model
High speed data transfer – up to 5.5GB/sec
Asynchronous data transfer
Large GPU memory systems
10
M02: High Performance Computing with CUDA
27. ary
ibr
L
CUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms)
on top of CUDA driver
Self-contained at the API level, no direct interaction with CUDA
driver
Basic model for use
Create matrix and vector objects in GPU memory space
Fill objects with data
Call sequence of CUBLAS functions
Retrieve data from GPU
CUBLAS library contains helper functions
Creating and destroying objects in GPU space
Writing data to and retrieving data from objects
11
M02: High Performance Computing with CUDA
28. ary
ibr
L
Using CUBLAS
Interface to CUBLAS library is in cublas.h
Function naming convention
cublas + BLAS name
Eg., cublasSGEMM
Error handling
CUBLAS core functions do not return error
CUBLAS provides function to retrieve last error recorded
CUBLAS helper functions do return error
Helper functions:
Memory allocation, data transfer
Implemented using C-based CUDA tool chain
Interfacing to C/C++ applications is trivial
13
M02: High Performance Computing with CUDA
32. ary
ibr
L
Calling CUBLAS from FORTRAN
Two interfaces:
Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c)
Allows interfacing to existing applications without any changes
During each call, the wrappers allocate GPU memory, copy source data
from CPU memory space to GPU memory space, call CUBLAS, and finally
copy back the results to CPU memory space and deallocate the GPGPU
memory
Intended for light testing due to call overhead
Non-Thunking (default)
Intended for production code
Substitute device pointers for vector and matrix arguments in all BLAS
functions
Existing applications need to be modified slightly to allocate and deallocate
data structures in GPGPU memory space (using CUBLAS_ALLOC and
CUBLAS_FREE) and to copy data between GPU and CPU memory
spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR,
CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)
14
M02: High Performance Computing with CUDA
33. ary
ibr
L
SGEMM example (THUNKING)
! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
……
! Initialize
……
#ifdef CUBLAS
! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of
! memory allocation on device and data movement)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#else
! Call SGEMM in host BLAS library
call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#endif
To use the host BLAS routine:
g95 –O3 code.f90 –L/usr/local/lib -lblas
To use the CUBLAS routine (fortran.c is provided by NVIDIA):
gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c
g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas
15
M02: High Performance Computing with CUDA
34. ary
ibr
L
SGEMM example (NON-THUNKING)
! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
integer:: devPtrA, devPtrB, devPtrC, size_of_real=4
……
! Initialize A, B, C
………
! Allocate matrices on GPU
cublasAlloc(m1*m1, size_of_real, devPtrA)
cublasAlloc(m1*m1, size_of_real, devPtrB)
cublasAlloc(m1*m1, size_of_real, devPtrC)
!Copy data from CPU to GPU
cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in
GPU memory)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
!Copy data from GPU to CPU
cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
! Free memory on device
cublasFree(devPtrA)
……
g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas
16
M02: High Performance Computing with CUDA
39. ary
ibr
L
CUFFT
The Fast Fourier Transform (FFT) is a divide-and-
conquer algorithm for efficiently computing discrete
Fourier transform of complex or real-valued data
sets.
CUFFT is the CUDA FFT library
Provides a simple interface for computing parallel FFT on
an NVIDIA GPU
Allows users to leverage the floating-point power and
parallelism of the GPU without having to develop a custom,
GPU-based FFT implementation
18
M02: High Performance Computing with CUDA
40. ary
ibr
L
Supported Features
1D, 2D and 3D transforms of complex and real-valued
data
Batched execution for doing multiple 1D transforms
in parallel
1D transform size up to 8M elements
2D and 3D transform sizes in the range [2,16384]
In-place and out-of-place transforms for real and
complex data.
19
M02: High Performance Computing with CUDA
41. ary
ibr
L
Transform Types
Library supports real and complex transforms
CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
Directions
CUFFT_FORWARD (-1) and CUFFT_INVERSE (1)
According to sign of the complex exponential term
Real and imaginary parts of complex input and
output arrays are interleaved
cufftComplex type is defined for this
Real to complex FFTs, output array holds only
nonredundant coefficients
N -> N/2+1
N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1)
For in-place transforms the input/output arrays need to be
padded
20
M02: High Performance Computing with CUDA
42. ary
ibr
L
More on Transforms
For 2D and 3D transforms, CUFFT performs transforms in row-
major (C-order)
If calling from FORTRAN or MATLAB, remember to change the
order of size parameters during plan creation
CUFFT performs un-normalized transforms:
IFFT(FFT(A))= length(A)*A
CUFFT API is modeled after FFTW. Based on plans, that
completely specify the optimal configuration to execute a
particular size of FFT
Once a plan is created, the library stores whatever state is
needed to execute the plan multiple times without recomputing
the configuration
Works very well for CUFFT, because different kinds of FFTs
require different thread configurations and GPU resources
21
M02: High Performance Computing with CUDA
49. lue
G
Interfacing CUDA with other languages
CUDA kernels from FORTRAN, allocate pinned
memory from FORTRAN
Calling CUDA from MATLAB with MEX files
Several packages (open source and commercial) to
interface CUDA with Python, IDL, .NET, FORTRAN
(Flagon). Browse CUDA Zone to find all the
packages.
23
M02: High Performance Computing with CUDA
50. lue
G
Pinned memory from FORTRAN
Pinned memory provides a fast PCI-e transfer speed and enables use of streams:
•Allocation needs to be done with cudaMallocHost
•Use new Fortran 2003 features for interoperability with C.
use iso_c_binding
! The allocation is performed by C function calls. Define the C pointer as type (C_PTR)
type(C_PTR) :: cptr_A, cptr_B, cptr_C
! Define Fortran arrays as pointer.
real, dimension(:,:), pointer :: A, B, C
! Allocating memory with cudaMallocHost.
! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the
! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1))
res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )
call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )
! Use A as usual.
! See example code for cudaMallocHost interface code
http://www.nvidia.com/object/cuda_programming_tools.html
24
M02: High Performance Computing with CUDA
51. lue
G
Calling CUDA kernels from FORTRAN
From Fortran call C function that will call CUDA kernel
! Fortran -> C -> CUDA ->C ->Fortran
call cudafunction(c,c2,N)
/* NB: Fortran subroutine arguments are passed by reference. */
extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np)
{
...
int N=*np;
cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N);
cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice);
dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1;
square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N);
cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost);
cudaFree(a_d);
}
complex_mul: main.f90 Cuda_function.o
$(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart
cuda_function.o: cuda_function.cu
nvcc -c -O3 cuda_function.cu
25
M02: High Performance Computing with CUDA
52. lue
G
CUDA & MATLAB
Even though MATLAB is built on many well-
optimized libraries, some functions can perform
better when written in a compiled language (e.g. C
and Fortran).
MATLAB provides a convenient API for interfacing
code written in C and FORTRAN to MATLAB
functions with MEX files.
MEX files could be used to exploit multi-core
processors with OpenMP or threaded codes or like
in this case to offload functions to the GPU.
26
M02: High Performance Computing with CUDA
53. lue
G
NVMEX
Native MATLAB script cannot parse CUDA code
New MATLAB script nvmex.m compiles CUDA code
(.cu) to create MATLAB function files
Syntax similar to original mex script:
>> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude
–LC:cudalib -lcudart
Available for Windows and Linux from:
http://developer.nvidia.com/object/matlab_cuda.html
27
M02: High Performance Computing with CUDA
54. lue
G
Mex files for CUDA
A typical mex file will perform the following steps:
1. Convert from double to single precision
2. Rearrange the data layout for complex data
3. Allocate memory on the GPU
4. Transfer the data from the host to the GPU
5. Perform computation on GPU (library, custom code)
6. Transfer results from the GPU to the host
7. Rearrange the data layout for complex data
8. Convert from single to double
9. Clean up memory and return results to MATLAB
Some of these steps will go away with new versions of the library
(2,7) and new hardware (1,8)
28
M02: High Performance Computing with CUDA
55. lue
G
CUDA MEX example
Additional code in MEX file to handle CUDA
/*Parse input, convert to single precision and to interleaved complex format */
…..
/* Allocate array on the GPU */
cufftComplex *rhs_complex_d;
cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M);
/* Copy input array in interleaved format to the GPU */
cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M,
cudaMemcpyHostToDevice);
/* Create plan for CUDA FFT NB: transposing dimensions*/
cufftPlan2d(&plan, N, M, CUFFT_C2C) ;
/* Execute FFT on GPU */
cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ;
/* Copy result back to host */
cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M,
cudaMemcpyDeviceToHost);
/* Clean up memory and plan on the GPU */
cufftDestroy(plan); cudaFree(rhs_complex_d);
/*Convert back to double precision and to split complex format */
….
29
M02: High Performance Computing with CUDA
56. lue
G
Timing details
1024x1024 mesh, 400 RK4 steps on Windows,
2D isotropic turbulence
Runtime Speed Runtime Speed
Opteron 250 Opteron 2210
up up
PCI-e Bandwidth: 1135 MB/s 1483 MB/s
Host to/from device 1003 MB/s 1223 MB/s
Standard MATLAB 8098 s 9525s
Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x
Overload Szeta 735 s 11.x 789s 12.X
Overload Szeta , FFT2 and 577 s 14.x 605s 15.7x
IFFT2
30
M02: High Performance Computing with CUDA
79. erf
P
!quot;#$%&'()*+
,-./'-/.%&0quot;10&(2%0! 34054067089-%&
:&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
<;quot;,=
?10,quot;;0(&0)quot;-0@(#A$%+
Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
:&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*
x y z Point structure
x y z x y z x y z AoS
x x x y y y z z z SoA
58
81. erf
P
!quot;#quot;$$%$&'%()#*&+#,-./%,/0#%
12"&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
<66%2/.quot;$&/)",-.%9%&-.=-&:quot;25>.5/-
<quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%"55#%66&3%#&,*,$%
+&(%()#*&,quot;2&6%#9.,%"6&(quot;2*&6.(0$/quot;2%)06&
Bank 0
quot;,,%66%6"6&./&-quot;6&:quot;2;6
Bank 1
Bank 2
Bank 3
'0$/.3$%&6.(0$/quot;2%)06",,%66%6&/)"&:quot;2; Bank 4
#%60$/&.2"&:quot;2;&,)28$.,/& Bank 5
Bank 6
?)28$.,/.2=",,%66%6"#%&6%#.quot;$.@%5
Bank 7
Bank 15
64
82. erf
P
!quot;#$%&''()**+#,%-.quot;/01)*
23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67*
8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3#
*7(+')%99%:
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
Thread 15 Bank 15 Thread 15 Bank 15
65
83. erf
P
!quot;#$%&''()**+#,%-.quot;/01)*
234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:*
;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,%
*:(+')%<<%2 *:(+')%<<%=
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
66
111. erf
P
!quot;#$%&'
()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56
+quot;7*'+%75
#&08quot;$.32*-*$+
Global memory loads/stores are coalesced
#&08.32*-*$+
(coherent) or non-coalesced (incoherent)
#'+8quot;$.32*-*$+
#'+8.32*-*$+
&3.%&8&3%0
Local loads/stores
&3.%&8'+3-*
Total branches and divergent branches
9-%$.2
0quot;)*-#*$+89-%$.2 taken by threads
quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+
1%-58'*-quot;%";* : +2-*%0,1%-5',+2%+,'*-quot;%";*,3$,%00-*'',.3$<".+',+3,
'2%-*0,3-,.3$'+%$+,7*73-=
.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'
62