Highlighted notes of:
Chapter 8: Graphics Interoperability
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Google Mobile Vision과 OpenCV로 card.io를 확장한 범용 카드번호인식 개발Hyukjae Jang
[Droid Knights 2018]
유명한 신용카드 인식 오픈소스인 card.io는 양각된 숫자만 인식하고 인쇄된 숫자를 인식하지 못하는 한계가 있습니다. Google Mobile Vision과 OpenCV를 활용하여 범용으로 카드 번호와 바코드를 인식하는 라이브러리를 개발한 경험을 나누고자 합니다.
Google Mobile Vision과 OpenCV로 card.io를 확장한 범용 카드번호인식 개발Hyukjae Jang
[Droid Knights 2018]
유명한 신용카드 인식 오픈소스인 card.io는 양각된 숫자만 인식하고 인쇄된 숫자를 인식하지 못하는 한계가 있습니다. Google Mobile Vision과 OpenCV를 활용하여 범용으로 카드 번호와 바코드를 인식하는 라이브러리를 개발한 경험을 나누고자 합니다.
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
Highlighted notes of:
Chapter 12: The Final Countdown
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
Highlighted notes of:
Chapter 11: CUDA C on Multiple GPUs
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : Introduction to CUDA C : NotesSubhajit Sahu
Highlighted notes of:
Chapter 3: Introduction to CUDA C
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Highlighted notes of:
Chapter 10: Streams
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
Slides from when I was teaching CS4052 Computer Graphics at Trinity College Dublin in Ireland.
These slides aren't used any more so they may as well be available to the public!
There are some mistakes in the slides, I'll try to comment below these.
This is the second lecture, and introduces programming with OpenGL 4 and shaders.
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 01 : Say Hello to CUDA
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
A Follow-up Cg Runtime Tutorial for Readers of The Cg TutorialMark Kilgard
A tutorial intended to help readers of "The Cg Tutorial" learn how to use the Cg runtime library to write a complete sample application that renders a bump-mapped torus. The torus is shaded by Cg vertex and fragment programs explained in Chapter 8 of "The Cg Tutorial" (ISBN 0321194969).
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
In next decade, for exascale high computing power and speed, new high performance computing (HPC) architectures, algorithms and corrections in existing technologies are expected. In order to achieve HPC parallelism is becoming a core emphasizing point. Keeping in view the advantages of parallelism, GPU is a unit that provides the better performance to achieve HPC in exascale computing system. So far, many programming models have been introduced to program GPU like CUDA, OpenGL, and OpenCL etc. and still there are number of limitations for these models that are required a deep glance to fix them. In order to enhance the performance in GPU programming in OpenGL, we have proposed an OpenGL based testing tool architecture for exascale computing system. This testing architecture detects the errors from OpenGL code and enforce to write the code in accurate way.
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
More Related Content
Similar to CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
Highlighted notes of:
Chapter 12: The Final Countdown
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
Highlighted notes of:
Chapter 11: CUDA C on Multiple GPUs
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : Introduction to CUDA C : NotesSubhajit Sahu
Highlighted notes of:
Chapter 3: Introduction to CUDA C
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Highlighted notes of:
Chapter 10: Streams
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
Slides from when I was teaching CS4052 Computer Graphics at Trinity College Dublin in Ireland.
These slides aren't used any more so they may as well be available to the public!
There are some mistakes in the slides, I'll try to comment below these.
This is the second lecture, and introduces programming with OpenGL 4 and shaders.
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 01 : Say Hello to CUDA
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
A Follow-up Cg Runtime Tutorial for Readers of The Cg TutorialMark Kilgard
A tutorial intended to help readers of "The Cg Tutorial" learn how to use the Cg runtime library to write a complete sample application that renders a bump-mapped torus. The torus is shaded by Cg vertex and fragment programs explained in Chapter 8 of "The Cg Tutorial" (ISBN 0321194969).
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
In next decade, for exascale high computing power and speed, new high performance computing (HPC) architectures, algorithms and corrections in existing technologies are expected. In order to achieve HPC parallelism is becoming a core emphasizing point. Keeping in view the advantages of parallelism, GPU is a unit that provides the better performance to achieve HPC in exascale computing system. So far, many programming models have been introduced to program GPU like CUDA, OpenGL, and OpenCL etc. and still there are number of limitations for these models that are required a deep glance to fix them. In order to enhance the performance in GPU programming in OpenGL, we have proposed an OpenGL based testing tool architecture for exascale computing system. This testing architecture detects the errors from OpenGL code and enforce to write the code in accurate way.
Similar to CUDA by Example : Graphics Interoperability : Notes (20)
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...Peter Gallagher
In this session delivered at Leeds IoT, I talk about how you can control a 3D printed Robot Arm with a Raspberry Pi, .NET 8, Blazor and SignalR.
I also show how you can use a Unity app on an Meta Quest 3 to control the arm VR too.
You can find the GitHub repo and workshop instructions here;
https://bit.ly/dotnetrobotgithub
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...PinkySharma900491
Class khatm kaam kaam karne kk kabhi uske kk innings evening karni nnod ennu Tak add djdhejs a Nissan s isme sniff kaam GCC bagg GB g ghan HD smart karmathtaa Niven ken many bhej kaam karne Nissan kaam kaam Karo kaam lal mam cell pal xoxo
CUDA by Example : Graphics Interoperability : Notes
1. 139
Chapter 8
Graphics
Interoperability
Since this book has focused on general-purpose computation, for the most part
we’ve ignored that GPUs contain some special-purpose components as well. The
GPU owes its success to its ability to perform complex rendering tasks in real
time, freeing the rest of the system to concentrate on other work. This leads us
to the obvious question: Can we use the GPU for both rendering and general-
purpose computation in the same application? What if the images we want to
render rely on the results of our computations? Or what if we want to take the
frame we’ve rendered and perform some image-processing or statistics compu-
tations on it?
Fortunately, not only is this interaction between general-purpose computation
and rendering modes possible, but it’s fairly easy to accomplish given what you
already know. CUDA C applications can seamlessly interoperate with either of the
two most popular real-time rendering APIs, OpenGL and DirectX. This chapter
will look at the mechanics by which you can enable this functionality.
The examples in this chapter deviate some from the precedents we’ve set in
previous chapters. In particular, this chapter assumes a significant amount about
your background with other technologies. Specifically, we have included a consid-
erable amount of OpenGL and GLUT code in these examples, almost none of
which will we explain in great depth. There are many superb resources to learn
graphics APIs, both online and in bookstores, but these topics are well beyond the
2. nteroperability
140
intended scope of this book. Rather, this chapter intends to focus on CUDA C and
the facilities it offers to incorporate it into your graphics applications. If you are
unfamiliar with OpenGL or DirectX, you are unlikely to derive much benefit from
this chapter and may want to skip to the next.
Chapter Objectives
graphics interoperability is and why you might use it.
You will learn how to set up a CUDA device for graphics interoperability.
You will learn how to share data between your CUDA C kernels and OpenGL
rendering.
Graphics Interoperation
To demonstrate the mechanics of interoperation between graphics and CUDA C,
we’ll write an application that works in two steps. The first step uses a CUDA C
kernel to generate image data. In the second step, the application passes this data
to the OpenGL driver to render. To accomplish this, we will use much of the CUDA
C we have seen in previous chapters along with some OpenGL and GLUT calls.
To start our application, we include the relevant GLUT and CUDA headers in order
to ensure the correct functions and enumerations are defined. We also define the
size of the window into which our application plans to render. At 512 x 512 pixels,
we will do relatively small drawings.
#define GL_GLEXT_PROTOTYPES
#include "GL/glut.h"
#include "cuda.h"
#include "cuda_gl_interop.h"
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 512
3. raphics nteroperation
141
nteroperation
Additionally, we declare two global variables that will store handles to the data we
intend to share between OpenGL and data. We will see momentarily how we use
these two variables, but they will store different handles to the same buffer. We
need two separate variables because OpenGL and CUDA will both have different
“names” for the buffer. The variable bufferObj will be OpenGL’s name for the
data, and the variable resource will be the CUDA C name for it.
GLuint bufferObj;
cudaGraphicsResource *resource;
Now let’s take a look at the actual application. The first thing we do is select a
CUDA device on which to run our application. On many systems, this is not a
complicated process, since they will often contain only a single CUDA-enabled
GPU. However, an increasing number of systems contain more than one CUDA-
enabled GPU, so we need a method to choose one. Fortunately, the CUDA runtime
provides such a facility to us.
int main( int argc, char **argv ) {
cudaDeviceProp prop;
int dev;
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 0;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );
You may recall that we saw cudaChooseDevice() in Chapter 3, but since it was
something of an ancillary point, we’ll review it again now. Essentially, this code tells
the runtime to select any GPU that has a compute capability of version 1.0 or better.
It accomplishes this by first creating and clearing a cudaDeviceProp structure
and then by setting its major version to 1 and minor version to 0. It passes this
information to cudaChooseDevice(), which instructs the runtime to select a
GPU in the system that satisfies the constraints specified by the cudaDeviceProp
structure. In the next chapter, we will look more at what is meant by a GPU’s
compute capability, but for now it suffices to say that it roughly indicates the features
a GPU supports. All CUDA-capable GPUs have at least compute capability 1.0, so
the net effect of this call is that the runtime will select any CUDA-capable device
and return an identifier for this device in the variable dev. There is no guarantee
4. nteroperability
142
that this device is the best or fastest GPU, nor is there a guarantee that the device
will be the same GPU from version to version of the CUDA runtime.
If the result of device selection is so seemingly underwhelming, why do
we bother with all this effort to fill a cudaDeviceProp structure and call
cudaChooseDevice() to get a valid device ID? Furthermore, we never hassled
with this tomfoolery before, so why now? These are good questions. It turns out
that we need to know the CUDA device ID so that we can tell the CUDA runtime
that we intend to use the device for CUDA and OpenGL. We achieve this with a
call to cudaGLSetGLDevice(), passing the device ID dev we obtained from
cudaChooseDevice():
HANDLE _ ERROR( cudaGLSetGLDevice( dev ) );
After the CUDA runtime initialization, we can proceed to initialize the OpenGL
driver by calling our GL Utility Toolkit (GLUT) setup functions. This sequence of
calls should look relatively familiar if you’ve used GLUT before:
// these GLUT calls need to be made before the other GL calls
glutInit( &argc, argv );
glutInitDisplayMode( GLUT_DOUBLE | GLUT_RGBA );
glutInitWindowSize( DIM, DIM );
glutCreateWindow( "bitmap" );
At this point in main(), we’ve prepared our CUDA runtime to play nicely with the
OpenGL driver by calling cudaGLSetGLDevice(). Then we initialized GLUT and
created a window named “bitmap” in which to draw our results. Now we can get
on to the actual OpenGL interoperation!
Shared data buffers are the key component to interoperation between CUDA C
kernels and OpenGL rendering. To pass data between OpenGL and CUDA, we will
first need to create a buffer that can be used with both APIs. We start this process
by creating a pixel buffer object in OpenGL and storing the handle in our global
variable GLuint bufferObj:
glGenBuffers( 1, &bufferObj );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );
glBufferData( GL_PIXEL_UNPACK_BUFFER_ARB, DIM * DIM * 4,
NULL, GL_DYNAMIC_DRAW_ARB );
5. raphics nteroperation
143
nteroperation
If you have never used a pixel buffer object (PBO) in OpenGL, you will typi-
cally create one with these three steps: First, we generate a buffer handle
with glGenBuffers(). Then, we bind the handle to a pixel buffer with
glBindBuffer(). Finally, we request the OpenGL driver to allocate a buffer for
us with glBufferData(). In this example, we request a buffer to hold DIM x DIM
32-bit values and use the enumerant GL_DYNAMIC_DRAW_ARB to indicate that the
buffer will be modified repeatedly by the application. Since we have no data to preload
the buffer with, we pass NULL as the penultimate argument to glBufferData().
All that remains in our quest to set up graphics interoperability is notifying the
CUDA runtime that we intend to share the OpenGL buffer named bufferObj
with CUDA. We do this by registering bufferObj with the CUDA runtime as a
graphics resource.
HANDLE_ERROR(
cudaGraphicsGLRegisterBuffer( &resource,
bufferObj,
cudaGraphicsMapFlagsNone )
);
We specify to the CUDA runtime that we intend to use the
OpenGL PBO bufferObj with both OpenGL and CUDA by calling
cudaGraphicsGLRegisterBuffer(). The CUDA runtime returns a CUDA-
friendly handle to the buffer in the variable resource. This handle will be used to
refer to bufferObj in subsequent calls to the CUDA runtime.
The flag cudaGraphicsMapFlagsNone specifies that there is no particular
behavior of this buffer that we want to specify, although we have the option to
specify with cudaGraphicsMapFlagsReadOnly that the buffer will be read-
only. We could also use cudaGraphicsMapFlagsWriteDiscard to specify
that the previous contents will be discarded, making the buffer essentially
write-only. These flags allow the CUDA and OpenGL drivers to optimize the hard-
ware settings for buffers with restricted access patterns, although they are not
required to be set.
Effectively, the call to glBufferData() requests the OpenGL driver to allocate a
buffer large enough to hold DIM x DIM 32-bit values. In subsequent OpenGL calls,
we’ll refer to this buffer with the handle bufferObj, while in CUDA runtime calls,
we’ll refer to this buffer with the pointer resource. Since we would like to read
from and write to this buffer from our CUDA C kernels, we will need more than just
a handle to the object. We will need an actual address in device memory that can be
6. nteroperability
144
passed to our kernel. We achieve this by instructing the CUDA runtime to map the
shared resource and then by requesting a pointer to the mapped resource.
uchar4* devPtr;
size_t size;
HANDLE_ERROR( cudaGraphicsMapResources( 1, &resource, NULL ) );
HANDLE_ERROR(
cudaGraphicsResourceGetMappedPointer( (void**)&devPtr,
&size,
resource )
);
We can then use devPtr as we would use any device pointer, except that the data
can also be used by OpenGL as a pixel source. After all these setup shenanigans,
the rest of main() proceeds as follows: First, we launch our kernel, passing it
the pointer to our shared buffer. This kernel, the code of which we have not seen
yet, generates image data to be rendered. Next, we unmap our shared resource.
This call is important to make prior to performing rendering tasks because it
provides synchronization between the CUDA and graphics portions of the applica-
tion. Specifically, it implies that all CUDA operations performed prior to the call
to cudaGraphicsUnmapResources() will complete before ensuing graphics
calls begin.
Lastly, we register our keyboard and display callback functions with GLUT
(key_func and draw_func), and we relinquish control to the GLUT rendering
loop with glutMainLoop().
dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( devPtr );
HANDLE_ERROR( cudaGraphicsUnmapResources( 1, &resource, NULL ) );
// set up GLUT and kick off main loop
glutKeyboardFunc( key_func );
glutDisplayFunc( draw_func );
glutMainLoop();
}
7. raphics nteroperation
145
nteroperation
The remainder of the application consists of the three functions we just high-
lighted, kernel(), key_func(), and draw_func(). So, let’s take a look at
those.
The kernel function takes a device pointer and generates image data. In the
following example, we’re using a kernel inspired by the ripple example in
Chapter 5:
// based on ripple code, but uses uchar4, which is the
// type of data graphic interop uses
__global__ void kernel( uchar4 *ptr ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
// now calculate the value at that position
float fx = x/(float)DIM - 0.5f;
float fy = y/(float)DIM - 0.5f;
unsigned char green = 128 + 127 *
sin( abs(fx*100) - abs(fy*100) );
// accessing uchar4 vs. unsigned char*
ptr[offset].x = 0;
ptr[offset].y = green;
ptr[offset].z = 0;
ptr[offset].w = 255;
}
Many familiar concepts are at work here. The method for turning thread and block
indices into x- and y-coordinates and a linear offset has been examined several
times. We then perform some reasonably arbitrary computations to determine the
color for the pixel at that (x,y) location, and we store those values to memory.
We’re again using CUDA C to procedurally generate an image on the GPU. The
important thing to realize is that this image will then be handed directly to OpenGL
for rendering without the CPU ever getting involved. On the other hand, in the
ripple example of Chapter 5, we generated image data on the GPU very much like
this, but our application then copied the buffer back to the CPU for display.
8. nteroperability
146
So, how do we draw the CUDA-generated buffer using OpenGL? Well, if you recall
the setup we performed in main(), you’ll remember the following:
glBindBuffer( GL _ PIXEL _ UNPACK _ BUFFER _ ARB, bufferObj );
This call bound the shared buffer as a pixel source for the OpenGL driver to
use in all subsequent calls to glDrawPixels(). Essentially, this means that
a call to glDrawPixels() is all that we need in order to render the image
data our CUDA C kernel generated. Consequently, the following is all that our
draw_func() needs to do:
static void draw_func( void ) {
glDrawPixels( DIM, DIM, GL_RGBA, GL_UNSIGNED_BYTE, 0 );
glutSwapBuffers();
}
It’s possible you’ve seen glDrawPixels() with a buffer pointer as the last argu-
ment. The OpenGL driver will copy from this buffer if no buffer is bound as a GL_
PIXEL_UNPACK_BUFFER_ARB source. However, since our data is already on the
GPU and we have bound our shared buffer as the GL_PIXEL_UNPACK_BUFFER_
ARB source, this last parameter instead becomes an offset into the bound buffer.
Because we want to render the entire buffer, this offset is zero for our application.
The last component to this example seems somewhat anticlimactic, but we’ve
decided to give our users a method to exit the application. In this vein, our
key_func() callback responds only to the Esc key and uses this as a signal to
clean up and exit:
static void key_func( unsigned char key, int x, int y ) {
switch (key) {
case 27:
// clean up OpenGL and CUDA
HANDLE_ERROR( cudaGraphicsUnregisterResource( resource ) );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, 0 );
glDeleteBuffers( 1, &bufferObj );
exit(0);
}
}
9. nteroperability
147
ipple GGG nteroperability
Figure 8.1 A screenshot of the hypnotic graphics interoperation example
When run, this example draws a mesmerizing picture in “NVIDIA Green” and
black, shown in Figure 8.1. Try using it to hypnotize your friends (or enemies).
GPU Ripple with Graphics
Interoperability
In “Section 8.1: Graphics Interoperation,” we referred to Chapter 5’s GPU ripple
example a few times. If you recall, that application created a CPUAnimBitmap
and passed it a function to be called whenever a frame needed to be generated.
int main( void ) {
DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_bitmap,
bitmap.image_size() ) );
10. nteroperability
148
bitmap.anim_and_exit( (void (*)(void*,int))generate_frame,
(void (*)(void*))cleanup );
}
With the techniques we’ve learned in the previous section, we intend to create a
GPUAnimBitmap structure. This structure will serve the same purpose as the
CPUAnimBitmap, but in this improved version, the CUDA and OpenGL compo-
nents will cooperate without CPU intervention. When we’re done, the application
will use a GPUAnimBitmap so that main() will become simply as follows:
int main( void ) {
GPUAnimBitmap bitmap( DIM, DIM, NULL );
bitmap.anim_and_exit(
(void (*)(uchar4*,void*,int))generate_frame, NULL );
}
The GPUAnimBitmap structure uses the same calls we just examined in
Section 8.1: Graphics Interoperation. However, now these calls will be abstracted
away in a GPUAnimBitmap structure so that future examples (and potentially
your own applications) will be cleaner.
THE GPUANIMBITMAP STRUCTURE8.3.1
Several of the data members for our GPUAnimBitmap will look familiar to you
from Section 8.1: Graphics Interoperation.
struct GPUAnimBitmap {
GLuint bufferObj;
cudaGraphicsResource *resource;
int width, height;
void *dataBlock;
void (*fAnim)(uchar4*,void*,int);
void (*animExit)(void*);
void (*clickDrag)(void*,int,int,int,int);
int dragStartX, dragStartY;
11. nteroperability
149
ipple GGG nteroperability
We know that OpenGL and the CUDA runtime will have different names for our
GPU buffer, and we know that we will need to refer to both of these names,
depending on whether we are making OpenGL or CUDA C calls. Therefore, our
structure will store both OpenGL’s bufferObj name and the CUDA runtime’s
resource name. Since we are dealing with a bitmap image that we intend to
display, we know that the image will have a width and height to it.
To allow users of our GPUAnimBitmap to register for certain callback events,
we will also store a void* pointer to arbitrary user data in dataBlock. Our
class will never look at this data but will simply pass it back to any registered
callback functions. The callbacks that a user may register are stored in fAnim,
animExit, and clickDrag. The function fAnim() gets called in every call to
glutIdleFunc(), and this function is responsible for producing the image data
that will be rendered in the animation. The function animExit() will be called
once, when the animation exits. This is where the user should implement cleanup
code that needs to be executed when the animation ends. Finally, clickDrag(),
an optional function, implements the user’s response to mouse click/drag events.
If the user registers this function, it gets called after every sequence of mouse
button press, drag, and release events. The location of the initial mouse click in
this sequence is stored in (dragStartX, dragStartY) so that the start and
endpoints of the click/drag event can be passed to the user when the mouse
button is released. This can be used to implement interactive animations that will
impress your friends.
Initializing a GPUAnimBitmap follows the same sequence of code that we saw
in our previous example. After stashing away arguments in the appropriate
structure members, we start by querying the CUDA runtime for a suitable CUDA
device:
GPUAnimBitmap( int w, int h, void *d ) {
width = w;
height = h;
dataBlock = d;
clickDrag = NULL;
12. nteroperability
150
// first, find a CUDA device and set it to graphic interop
cudaDeviceProp prop;
int dev;
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 0;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );
After finding a compatible CUDA device, we make the important
cudaGLSetGLDevice() call to the CUDA runtime in order to notify it that we
intend to use dev as a device for interoperation with OpenGL:
cudaGLSetGLDevice( dev );
Since our framework uses GLUT to create a windowed rendering environment, we
need to initialize GLUT. This is unfortunately a bit awkward, since glutInit()
wants command-line arguments to pass to the windowing system. Since we have
none we want to pass, we would like to simply specify zero command-line argu-
ments. Unfortunately, some versions of GLUT have a bug that cause applications
to crash when zero arguments are given. So, we trick GLUT into thinking that
we’re passing an argument, and as a result, life is good.
int c=1;
char *foo = "name";
glutInit( &c, &foo );
We continue initializing GLUT exactly as we did in the previous example. We
create a window in which to render, specifying a title with the string “bitmap.” If
you’d like to name your window something more interesting, be our guest.
glutInitDisplayMode( GLUT_DOUBLE | GLUT_RGBA );
glutInitWindowSize( width, height );
glutCreateWindow( "bitmap" );
13. nteroperability
151
ipple GGG nteroperability
Next, we request for the OpenGL driver to allocate a buffer handle that we imme-
diately bind to the GL_PIXEL_UNPACK_BUFFER_ARB target to ensure that future
calls to glDrawPixels() will draw to our interop buffer:
glGenBuffers( 1, &bufferObj );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );
Last, but most certainly not least, we request that the OpenGL driver allocate a
region of GPU memory for us. Once this is done, we inform the CUDA runtime of
this buffer and request a CUDA C name for this buffer by registering bufferObj
with cudaGraphicsGLRegisterBuffer().
glBufferData( GL_PIXEL_UNPACK_BUFFER_ARB, width * height * 4,
NULL, GL_DYNAMIC_DRAW_ARB );
HANDLE_ERROR(
cudaGraphicsGLRegisterBuffer( &resource,
bufferObj,
cudaGraphicsMapFlagsNone ) );
}
With the GPUAnimBitmap set up, the only remaining concern is exactly how
we perform the rendering. The meat of the rendering will be done in our
glutIdleFunction(). This function will essentially do three things. First, it
maps our shared buffer and retrieves a GPU pointer for this buffer.
// static method used for GLUT callbacks
static void idle_func( void ) {
static int ticks = 1;
GPUAnimBitmap* bitmap = *(get_bitmap_ptr());
uchar4* devPtr;
size_t size;
14. nteroperability
152
HANDLE_ERROR(
cudaGraphicsMapResources( 1, &(bitmap->resource), NULL )
);
HANDLE_ERROR(
cudaGraphicsResourceGetMappedPointer( (void**)&devPtr,
&size,
bitmap->resource )
);
Second, it calls the user-specified function fAnim() that presumably will launch
a CUDA C kernel to fill the buffer at devPtr with image data.
bitmap->fAnim( devPtr, bitmap->dataBlock, ticks++ );
And lastly, it unmaps the GPU pointer that will release the buffer for use by
the OpenGL driver in rendering. This rendering will be triggered by a call to
glutPostRedisplay().
HANDLE_ERROR(
cudaGraphicsUnmapResources( 1,
&(bitmap->resource),
NULL ) );
glutPostRedisplay();
}
The remainder of the GPUAnimBitmap structure consists of important but some-
what tangential infrastructure code. If you have an interest in it, you should by all
means examine it. But we feel that you’ll be able to proceed successfully, even if
you lack the time or interest to digest the rest of the code in GPUAnimBitmap.
GPU RIPPLE REDUX8.3.2
Now that we have a GPU version of CPUAnimBitmap, we can proceed to
retrofit our GPU ripple application to perform its animation entirely on the GPU.
To begin, we will include gpu_anim.h, the home of our implementation of
15. nteroperability
153
ipple GGG nteroperability
GPUAnimBitmap. We also include nearly the same kernel as we examined in
Chapter 5.
#include "../common/book.h"
#include "../common/gpu_anim.h"
#define DIM 1024
__global__ void kernel( uchar4 *ptr, int ticks ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
// now calculate the value at that position
float fx = x - DIM/2;
float fy = y - DIM/2;
float d = sqrtf( fx * fx + fy * fy );
unsigned char grey = (unsigned char)(128.0f + 127.0f *
cos(d/10.0f -
ticks/7.0f) /
(d/10.0f + 1.0f));
ptr[offset].x = grey;
ptr[offset].y = grey;
ptr[offset].z = grey;
ptr[offset].w = 255;
}
The one and only change we’ve made is highlighted. The reason for this change
is because OpenGL interoperation requires that our shared surfaces be “graphics
friendly.” Because real-time rendering typically uses arrays of four-component
(red/green/blue/alpha) data elements, our target buffer is no longer simply an
array of unsigned char as it previously was. It’s now required to be an array of
type uchar4. In reality, we treated our buffer in Chapter 5 as a four-component
buffer, so we always indexed it with ptr[offset*4+k], where k indicates the
component from 0 to 3. But now, the four-component nature of the data is made
explicit with the switch to a uchar4 type.
16. nteroperability
154
Since kernel() is a CUDA C function that generates image data, all that
remains is writing a host function that will be used as a callback in the
idle_func() member of GPUAnimBitmap. For our current application,
all this function does is launch the CUDA C kernel:
void generate_frame( uchar4 *pixels, void*, int ticks ) {
dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( pixels, ticks );
}
That’s basically everything we need, since all of the heavy lifting was
done in the GPUAnimBitmap structure. To get this party started, we just
create a GPUAnimBitmap and register our animation callback function,
generate_frame().
int main( void ) {
GPUAnimBitmap bitmap( DIM, DIM, NULL );
bitmap.anim_and_exit(
(void (*)(uchar4*,void*,int))generate_frame, NULL );
}
Heat Transfer with Graphics Interop
So, what has been the point of doing all of this? If you look at the internals of the
CPUAnimBitmap, the structure we used for previous animation examples, we
would see that it works almost exactly like the rendering code in Section 8.1:
Graphics Interoperation.
Almost.
The key difference between the CPUAnimBitmap and the previous example is
buried in the call to glDrawPixels().
17. nterop
155
â•⁄
glDrawPixels( bitmap->x,
bitmap->y,
GL_RGBA,
GL_UNSIGNED_BYTE,
bitmap->pixels );
We remarked in the first example of this chapter that you may have previously
seen calls to glDrawPixels() with a buffer pointer as the last argument.
Well, if you hadn’t before, you have now. This call in the Draw() routine of
CPUAnimBitmap triggers a copy of the CPU buffer in bitmap->pixels to the
GPU for rendering. To do this, the CPU needs to stop what it’s doing and initiate
a copy onto the GPU for every frame. This requires synchronization between the
CPU and GPU and additional latency to initiate and complete a transfer over the
PCI Express bus. Since the call to glDrawPixels() expects a host pointer in
the last argument, this also means that after generating a frame of image data
with a CUDA C kernel, our Chapter 5 ripple application needed to copy the frame
from the GPU to the CPU with a cudaMemcpy().
void generate_frame( DataBlock *d, int ticks ) {
dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( d->dev_bitmap, ticks );
HANDLE_ERROR( cudaMemcpy( d->bitmap->get_ptr(),
d->dev_bitmap,
d->bitmap->image_size(),
cudaMemcpyDeviceToHost ) );
}
Taken together, these facts mean that our original GPU ripple application
was more than a little silly. We used CUDA C to compute image values for our
rendering in each frame, but after the computations were done, we copied the
buffer to the CPU, which then copied the buffer back to the GPU for display. This
means that we introduced unnecessary data transfers between the host and
18. nteroperability
156
the device that stood between us and maximum performance. Let’s revisit a
compute-intensive animation application that might see its performance improve
by migrating it to use graphics interoperation for its rendering.
If you recall the previous chapter’s heat simulation application, you will
remember that it also used CPUAnimBitmap in order to display the output of its
simulation computations. We will modify this application to use our newly imple-
mented GPUAnimBitmap structure and look at how the resulting performance
changes. As with the ripple example, our GPUAnimBitmap is almost a perfect
drop-in replacement for CPUAnimBitmap, with the exception of the unsigned
char to uchar4 change. So, the signature of our animation routine changes in
order to accommodate this shift in data types.
void anim_gpu( uchar4* outputBitmap, DataBlock *d, int ticks ) {
HANDLE_ERROR( cudaEventRecord( d->start, 0 ) );
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
// since tex is global and bound, we have to use a flag to
// select which is in/out per iteration
volatile bool dstOut = true;
for (int i=0; i<90; i++) {
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
copy_const_kernel<<<blocks,threads>>>( in );
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
}
float_to_color<<<blocks,threads>>>( outputBitmap,
d->dev_inSrc );
19. nterop
157
â•⁄
HANDLE_ERROR( cudaEventRecord( d->stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( d->stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
d->start, d->stop ) );
d->totalTime += elapsedTime;
++d->frames;
printf( "Average Time per frame: %3.1f msn",
d->totalTime/d->frames );
}
Since the float_to_color() kernel is the only function that actually uses the
outputBitmap, it’s the only other function that needs modification as a result
of our shift to uchar4. This function was simply considered utility code in the
previous chapter, and we will continue to consider it utility code. However, we
have overloaded this function and included both unsigned char and uchar4
versions in book.h. You will notice that the differences between these func-
tions are identical to the differences between kernel() in the CPU-animated
and GPU-animated versions of GPU ripple. Most of the code for the float_to_
color() kernels has been omitted for clarity, but we encourage you to consult
book.h if you’re dying to see the details.
__global__ void float_to_color( unsigned char *optr,
const float *outSrc ) {
// convert floating-point value to 4-component color
optr[offset*4 + 0] = value( m1, m2, h+120 );
optr[offset*4 + 1] = value( m1, m2, h );
optr[offset*4 + 2] = value( m1, m2, h -120 );
optr[offset*4 + 3] = 255;
}
20. nteroperability
158
__global__ void float_to_color( uchar4 *optr,
const float *outSrc ) {
// convert floating-point value to 4-component color
optr[offset].x = value( m1, m2, h+120 );
optr[offset].y = value( m1, m2, h );
optr[offset].z = value( m1, m2, h -120 );
optr[offset].w = 255;
}
Outside of these changes, the only major difference is in the change from
CPUAnimBitmap to GPUAnimBitmap to perform animation.
int main( void ) {
DataBlock data;
GPUAnimBitmap bitmap( DIM, DIM, &data );
data.totalTime = 0;
data.frames = 0;
HANDLE_ERROR( cudaEventCreate( &data.start ) );
HANDLE_ERROR( cudaEventCreate( &data.stop ) );
int imageSize = bitmap.image_size();
// assume float == 4 chars in size (i.e., rgba)
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texConstSrc,
data.dev_constSrc,
imageSize ) );
21. nterop
159
â•⁄
HANDLE_ERROR( cudaBindTexture( NULL, texIn,
data.dev_inSrc,
imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texOut,
data.dev_outSrc,
imageSize ) );
// initialize the constant data
float *temp = (float*)malloc( imageSize );
for (int i=0; i<DIM*DIM; i++) {
temp[i] = 0;
int x = i % DIM;
int y = i / DIM;
if ((x>300) && (x<600) && (y>310) && (y<601))
temp[i] = MAX_TEMP;
}
temp[DIM*100+100] = (MAX_TEMP + MIN_TEMP)/2;
temp[DIM*700+100] = MIN_TEMP;
temp[DIM*300+300] = MIN_TEMP;
temp[DIM*200+700] = MIN_TEMP;
for (int y=800; y<900; y++) {
for (int x=400; x<500; x++) {
temp[x+y*DIM] = MIN_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_constSrc, temp,
imageSize,
cudaMemcpyHostToDevice ) );
// initialize the input data
for (int y=800; y<DIM; y++) {
for (int x=0; x<200; x++) {
temp[x+y*DIM] = MAX_TEMP;
}
}
22. nteroperability
160
HANDLE_ERROR( cudaMemcpy( data.dev_inSrc, temp,
imageSize,
cudaMemcpyHostToDevice ) );
free( temp );
bitmap.anim_and_exit( (void (*)(uchar4*,void*,int))anim_gpu,
(void (*)(void*))anim_exit );
}
Although it might be instructive to take a glance at the rest of this enhanced heat
simulation application, it is not sufficiently different from the previous chapter’s
version to warrant more description. The important component is answering the
question, how does performance change now that we’ve completely migrated the
application to the GPU? Without having to copy every frame back to the host for
display, the situation should be much happier than it was previously.
So, exactly how much better is it to use the graphics interoperability to perform
the rendering? Previously, the heat transfer example consumed about 25.3ms per
frame on our GeForce GTX 285–based test machine. After converting the appli-
cation to use graphics interoperability, this drops by 15 percent to 21.6ms per
frame. The net result is that our rendering loop is 15 percent faster and no longer
requires intervention from the host every time we want to display a frame. That’s
not bad for a day’s work!
DirectX Interoperability
Although we’ve looked only at examples that use interoperation with the OpenGL
rendering system, DirectX interoperation is nearly identical. You will still use a
cudaGraphicsResource to refer to buffers that you share between DirectX
and CUDA, and you will still use calls to cudaGraphicsMapResources() and
cudaGraphicsResourceGetMappedPointer() to retrieve CUDA-friendly
pointers to these shared resources.
For the most part, the calls that differ between OpenGL and DirectX interoperability
have embarrassingly simple translations to DirectX. For example, rather than
calling cudaGLSetGLDevice(), we call cudaD3D9SetDirect3DDevice()
to specify that a CUDA device should be enabled for Direct3D 9.0 interoperability.
23. eview
161
eview
Likewise, cudaD3D10SetDirect3DDevice() enables a device for Direct3D 10
interoperation and cudaD3D11SetDirect3DDevice() for Direct3D 11.
The details of DirectX interoperability probably will not surprise you if you’ve
worked through this chapter’s OpenGL examples. But if you want to use DirectX
interoperation and want a small project to get started, we suggest that you
migrate this chapter’s examples to use DirectX. To get started, we recom-
mend consulting the NVIDIA CUDA Programming Guide for a reference on the
API and taking a look at the GPU Computing SDK code samples on DirectX
interoperability.
Chapter Review
Although much of this book has been devoted to using the GPU for parallel,
general-purpose computing, we can’t forget the GPU’s successful day job as a
rendering engine. Many applications require or would benefit from the use of
standard computer graphics rendering. Since the GPU is master of the rendering
domain, all that stood between us and the exploitation of these resources was
a lack of understanding of the mechanics in convincing the CUDA runtime and
graphics drivers to cooperate. Now that we have seen how this is done, we
no longer need the host to intervene in displaying the graphical results of our
computations. This simultaneously accelerates the application’s rendering loop
and frees the host to perform other computations in the meantime. Otherwise,
if there are no other computations to be performed, it leaves our system more
responsive to other events or applications.
There are many other ways to use graphics interoperability that we left unex-
plored. We looked primarily at using a CUDA C kernel to write into a pixel buffer
object for display in a window. This image data can also be used as a texture that
can be applied to any surface in the scene. In addition to modifying pixel buffer
objects, you can also share vertex buffer objects between CUDA and the graphics
engine. Among other things, this allows you to write CUDA C kernels that perform
collision detection between objects or compute vertex displacement maps to be
used to render objects or surfaces that interact with the user or their surround-
ings. If you’re interested in computer graphics, CUDA C’s graphics interoperability
API enables a slew of new possibilities for your applications!