Highlighted notes of:
Chapter 3: Introduction to CUDA C
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 01 : Say Hello to CUDA
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
CUDA Tutorial 02 : CUDA in Actions : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 02 : CUDA in Actions
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
Highlighted notes of:
Chapter 5: Thread Cooperation
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : Graphics Interoperability : NotesSubhajit Sahu
Highlighted notes of:
Chapter 8: Graphics Interoperability
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
This is a simple quantum circuit simulator example in Qsikit. This was run using Anaconda and Jupyter Notebook. This example demonstrates the first level of quantum circuit design, code and run in a notebook.
C++ How I learned to stop worrying and love metaprogrammingcppfrug
Cette présentation parcours quelques applications directes de la méta-programmation en C++(11/14) avec comme objectif de démontrer son utilité dans un cadre applicatif.
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 01 : Say Hello to CUDA
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
CUDA Tutorial 02 : CUDA in Actions : NotesSubhajit Sahu
Highlighted notes of:
CUDA Tutorial 02 : CUDA in Actions
Written by: Putt Sakdhnagool
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
Highlighted notes of:
Chapter 5: Thread Cooperation
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : Graphics Interoperability : NotesSubhajit Sahu
Highlighted notes of:
Chapter 8: Graphics Interoperability
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
This is a simple quantum circuit simulator example in Qsikit. This was run using Anaconda and Jupyter Notebook. This example demonstrates the first level of quantum circuit design, code and run in a notebook.
C++ How I learned to stop worrying and love metaprogrammingcppfrug
Cette présentation parcours quelques applications directes de la méta-programmation en C++(11/14) avec comme objectif de démontrer son utilité dans un cadre applicatif.
Mistakes to avoid when designing DLLs and thoughts about other platforms to deliever with your DLL.
Some compilers provide mechanisms to automatically export all functions and variables in a library that have external linkage. Avoid using any such mechanisms. Export exactly the interface that you need to export, and no more.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
This presentation was held at the Stockholm Rust Meetup in September 2019.
This is a brief introduction to Rust and highlights some of the problems with C++ that it attempts to solve. It also contain a brief introduction to the ownership model and the borrow checker that Rust uses.
Highlighted notes of:
Chapter 10: Streams
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
Highlighted notes of:
Chapter 11: CUDA C on Multiple GPUs
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Mistakes to avoid when designing DLLs and thoughts about other platforms to deliever with your DLL.
Some compilers provide mechanisms to automatically export all functions and variables in a library that have external linkage. Avoid using any such mechanisms. Export exactly the interface that you need to export, and no more.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Azul Virtual Machine Engineer Douglas Hawkins describes how decisions made by the JVM affect how your code is compiled and run. Learn how this affects application performance and what steps you can take to optimize how the JVM acts on your code.
Multithreading with modern C++ is hard. Undefined variables, Deadlocks, Livelocks, Race Conditions, Spurious Wakeups, the Double Checked Locking Pattern, etc. And at the base is the new Memory-Modell which make the life not easier. The story of things which can go wrong is very long. In this talk I give you a tour through the things which can go wrong and show how you can avoid them.
This presentation was held at the Stockholm Rust Meetup in September 2019.
This is a brief introduction to Rust and highlights some of the problems with C++ that it attempts to solve. It also contain a brief introduction to the ownership model and the borrow checker that Rust uses.
Highlighted notes of:
Chapter 10: Streams
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
Highlighted notes of:
Chapter 11: CUDA C on Multiple GPUs
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
Applying Anti-Reversing Techniques to Machine CodeTeodoro Cipresso
CS266 Software Reverse Engineering (SRE)Applying Anti-Reversing Techniques to Machine Code
Teodoro (Ted) Cipresso, teodoro.cipresso@sjsu.edu
Department of Computer Science
San José State University
Spring 2015
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Consequences of using the Copy-Paste method in C++ programming and how to dea...Andrey Karpov
I create the PVS-Studio analyzer detecting errors in source code of C/C++/C++0x software. So I have to review a large amount of source code of various applications where we detected suspicious code fragments with the help of PVS-Studio. I have collected a lot of examples demonstrating that an error occurred because of copying and modifying a code fragment. Of course, it has been known for a long time that using Copy-Paste in programming is a bad thing. But let's try to investigate this problem closely instead of limiting ourselves to just saying "do not copy the code".
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
Exascale computing refers to a computing system which is capable to at least one exaflop in next couple of years. Many new programming models, architectures and algorithms have been introduced to attain the objective for exascale computing system. The primary objective is to enhance the system performance. In modern/super computers, GPU is being used to attain the high computing performance. However, it’s the objective of proposed technologies and programming models is almost same to make the GPU more powerful. But these technologies are still facing the number of challenges including parallelism, scale and complexity and also many more that must be fixed to achieve make computing system more powerful and efficient. In this paper, we have present a testing tool architecture for a parallel programming approach using two programming models as CUDA and OpenMP. Both CUDA and OpenMP could be used to program shared memory and GPU cores. The object of this architecture is to identify the static errors in the program that occurred during writing the code and cause absence of parallelism. Our architecture enforces the developers to write the feasible code through we can avoid from the essential errors in the program and run successfully.
Concisely describe the following terms 40 1. Source code 2. Object c.pdffeelinggift
Concisely describe the following terms 40% 1. Source code 2. Object code 3. Compiler 4.
Algorithm 5. Byte (used for what purpose) 6. Nano second or ns (used for what purpose) 7.
System program (provide two examples) 8. Application program (provide two examples) 9.
Preprocessing directive 10. ASCII
Solution
Answers:
1.
A programmer writes a program in a particular programming language (ex : c , c++ ,java).This
form of the program is called the source code. it can be read and easily understood by a human
being.
For example : Source code for Hello world program is:
#include
void main()
{
printf(“Hello World”);
}
2.
An interpreter or a complier translates source code into executable machine code. This machine
code is called as the object code.
Complier translates source code into object code. Object code contains instructions to be
executed by the computer.
3.
A complier is a program that translates a source program or source code written in particular
programming language ( such as c , c++ or java ) into machine language (code).
Example : Turbo c compiler.
4.
Step –by- step instructions of a program is called is an algorithm.
5.
1 byte = 8 bits.
A byte is a unit of measurement used to measure the data.
Each byte represents a character.
6.
A nano second or ns is a unit of time representing 10-9 or 1 billionth of a second. computer
memory speed is represented in nano seconds.
7.
It is a type of computer program that is designed to run a computer’s hardware and application
programs.
Examples are :
OS (operating system)
BIOS
Boot program.
8.
Application program is a software program that runs on computer.
Examples are word processors, web browsers.
9.
Preprocessor directives are invoked by the complier to process some programs before
compilation.
Preprocessor directives are lines included in a program that begin with the character #.
10.
ASCII: American Standard Code for Information Interchange)
It is the most common format for text files in computers and on the internet.
Short Answers:
1.
Source code
|
Complier
| Assembly code
Assembler
Libraries | Object code
Link Editor
|
Executable code
2.
a)
Memory unit (storage)
ALU (Arithmetic and Logical unit)
CU (Control Unit)
b)
Storage:
All the data to be processed and the instruction required for processing.
Intermediate results of processing.
Final results of processing before these results are released to an output device.
ALU (Arithmetic and Logical unit):
It performs all the arithmetic operations (like addition, subtraction etc) and logical operations.
CU (Control Unit):
The control unit directs and controls the activities of the internal and external devices.
3.
Fetch the instruction
Decode the instruction
Read the effective address
Execute the instruction
In executing the instruction Arithmetic and Logical unit (ALU) performs mathematical or logical
operations .ALU sends a condition signal back to the control unit (CU). The result generated by
the operation is stored in the main memory or sent to o.
Checking the Open-Source Multi Theft Auto GameAndrey Karpov
We haven't used PVS-Studio to check games for a long time. So, this time we decided to return to this practice and picked out the MTA project. Multi Theft Auto (MTA) is a multiplayer modification for PC versions of the Grand Theft Auto: San Andreas game by Rockstar North that adds online multiplayer functionality. As Wikipedia tells us, the specific feature of the game is "well optimized code with fewest bugs possible". OK, let's ask our analyzer for opinion.
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...Peter Gallagher
In this session delivered at Leeds IoT, I talk about how you can control a 3D printed Robot Arm with a Raspberry Pi, .NET 8, Blazor and SignalR.
I also show how you can use a Unity app on an Meta Quest 3 to control the arm VR too.
You can find the GitHub repo and workshop instructions here;
https://bit.ly/dotnetrobotgithub
Google Calendar is a versatile tool that allows users to manage their schedules and events effectively. With Google Calendar, you can create and organize calendars, set reminders for important events, and share your calendars with others. It also provides features like creating events, inviting attendees, and accessing your calendar from mobile devices. Additionally, Google Calendar allows you to embed calendars in websites or platforms like SlideShare, making it easier for others to view and interact with your schedules.
1. 21
Chapter 3
Introduction to CUDA C
If you read Chapter 1, we hope we have convinced you of both the immense
computational power of graphics processors and that you are just the
programmer to harness it. And if you continued through Chapter 2, you should
have a functioning environment set up in order to compile and run the code
you’ll be writing in CUDA C. If you skipped the first chapters, perhaps you’re just
skimming for code samples, perhaps you randomly opened to this page while
browsing at a bookstore, or maybe you’re just dying to get started; that’s OK, too
(we won’t tell). Either way, you’re ready to get started with the first code exam-
ples, so let’s go.
2. C
22
Chapter Objectives
host and code written
for a device.
You will learn how to run device code from the host.
You will learn about the ways device memory can be used on CUDA-capable
devices.
You will learn how to query your system for information on its CUDA-capable
devices.
A First Program
â•⁄
#include "../common/book.h"
int main( void ) {
printf( "Hello, World!n" );
return 0;
}
At this point, no doubt you’re wondering whether this book is a scam. Is this just
C? Does CUDA C even exist? The answers to these questions are both in the affir-
mative; this book is not an elaborate ruse. This simple “Hello, World!” example is
3. rogram
23
â•⁄
kernel() qualified with __global__
A call to the empty function, embellished with <<<1,1>>>
As we saw in the previous section, code is compiled by your system’s standard
C compiler by default. For example, GNU gcc might compile your host code
4. C
24
on Linux operating systems, while Microsoft Visual C compiles it on Windows
systems. The NVIDIA tools simply feed this host compiler your code, and every-
thing behaves as it would in a world without CUDA.
Now we see that CUDA C adds the __global__ qualifier to standard C. This
mechanism alerts the compiler that a function should be compiled to run on
a device instead of the host. In this simple example, nvcc gives the function
kernel() to the compiler that handles device code, and it feeds main() to the
host compiler as it did in the previous example.
So, what is the mysterious call to kernel(), and why must we vandalize our
standard C with angle brackets and a numeric tuple? Brace yourself, because this
is where the magic happens.
We have seen that CUDA C needed a linguistic method for marking a function
as device code. There is nothing special about this; it is shorthand to send host
code to one compiler and device code to another compiler. The trick is actually in
calling the device code from the host code. One of the benefits of CUDA C is that
it provides this language integration so that device function calls look very much
like host function calls. Later we will discuss what actually happens behind the
scenes, but suffice to say that the CUDA compiler and runtime take care of the
messy business of invoking device code from the host.
So, the mysterious-looking call invokes device code, but why the angle brackets
and numbers? The angle brackets denote arguments we plan to pass to the
runtime system. These are not arguments to the device code but are parameters
that will influence how the runtime will launch our device code. We will learn
about these parameters to the runtime in the next chapter. Arguments to the
device code itself get passed within the parentheses, just like any other function
invocation.
PAssInG PArAmeters3.2.3
We’ve promised the ability to pass parameters to our kernel, and the time has
come for us to make good on that promise. Consider the following enhancement
to our “Hello, World!” application:
5. rogram
25
â•⁄
#include <iostream>
#include "book.h"
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
int main( void ) {
int c;
int *dev_c;
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );
add<<<1,1>>>( 2, 7, dev_c );
HANDLE_ERROR( cudaMemcpy( &c,
dev_c,
sizeof(int),
cudaMemcpyDeviceToHost ) );
printf( "2 + 7 = %dn", c );
cudaFree( dev_c );
return 0;
}
You will notice a handful of new lines here, but these changes introduce only two
concepts:
We can pass parameters to a kernel as we would with any C function.•
We need to allocate memory to do anything useful on a device, such as return•
values to the host.
There is nothing special about passing parameters to a kernel. The angle-bracket
syntax notwithstanding, a kernel call looks and acts exactly like any function call
in standard C. The runtime system takes care of any complexity introduced by the
fact that these parameters need to get from the host to the device.
6. C
26
The more interesting addition is the allocation of memory using cudaMalloc().
This call behaves very similarly to the standard C call malloc(), but it tells
the CUDA runtime to allocate the memory on the device. The first argument
is a pointer to the pointer you want to hold the address of the newly allocated
memory, and the second parameter is the size of the allocation you want to make.
Besides that your allocated memory pointer is not the function’s return value,
this is identical behavior to malloc(), right down to the void* return type. The
HANDLE_ERROR() that surrounds these calls is a utility macro that we have
provided as part of this book’s support code. It simply detects that the call has
returned an error, prints the associated error message, and exits the application
with an EXIT_FAILURE code. Although you are free to use this code in your own
applications, it is highly likely that this error-handling code will be insufficient in
production code.
This raises a subtle but important point. Much of the simplicity and power of
CUDA C derives from the ability to blur the line between host and device code.
However, it is the responsibility of the programmer not to dereference the pointer
returned by cudaMalloc() from code that executes on the host. Host code may
pass this pointer around, perform arithmetic on it, or even cast it to a different
type. But you cannot use it to read or write from memory.
Unfortunately, the compiler cannot protect you from this mistake, either. It will
be perfectly happy to allow dereferences of device pointers in your host code
because it looks like any other pointer in the application. We can summarize the
restrictions on the usage of device pointer as follows:
You can pass pointers allocated with cudaMalloc() to functions that
execute on the device.
You can use pointers allocated with cudaMalloc()to read or write
memory from code that executes on the device.
You can pass pointers allocated with cudaMalloc()to functions that
execute on the host.
You cannot use pointers allocated with cudaMalloc()to read or write
memory from code that executes on the host.
If you’ve been reading carefully, you might have anticipated the next lesson: We
can’t use standard C’s free() function to release memory we’ve allocated with
cudaMalloc(). To free memory we’ve allocated with cudaMalloc(), we need
to use a call to cudaFree(), which behaves exactly like free() does.
7. QueryInG devIces
27
evices
We’ve seen how to use the host to allocate and free memory on the device, but
we’ve also made it painfully clear that you cannot modify this memory from the
host. The remaining two lines of the sample program illustrate two of the most
common methods for accessing device memory—by using device pointers from
within device code and by using calls to cudaMemcpy().
We use pointers from within device code exactly the same way we use them in
standard C that runs on the host code. The statement *c = a + b is as simple
as it looks. It adds the parameters a and b together and stores the result in the
memory pointed to by c. We hope this is almost too easy to even be interesting.
We listed the ways in which we can and cannot use device pointers from within
device and host code. These caveats translate exactly as one might imagine
when considering host pointers. Although we are free to pass host pointers
around in device code, we run into trouble when we attempt to use a host pointer
to access memory from within device code. To summarize, host pointers can
access memory from host code, and device pointers can access memory from
device code.
As promised, we can also access memory on a device through calls to
cudaMemcpy()from host code. These calls behave exactly like standard C
memcpy() with an additional parameter to specify which of the source and
destination pointers point to device memory. In the example, notice that the last
parameter to cudaMemcpy() is cudaMemcpyDeviceToHost, instructing the
runtime that the source pointer is a device pointer and the destination pointer is a
host pointer.
Unsurprisingly, cudaMemcpyHostToDevice would indicate the opposite situ-
ation, where the source data is on the host and the destination is an address on
the device. Finally, we can even specify that both pointers are on the device by
passing cudaMemcpyDeviceToDevice. If the source and destination pointers
are both on the host, we would simply use standard C’s memcpy() routine to copy
between them.
Querying Devices
Since we would like to be allocating memory and executing code on our device,
it would be useful if our program had a way of knowing how much memory and
what types of capabilities the device had. Furthermore, it is relatively common for
8. C
28
people to have more than one CUDA-capable device per computer. In situations
like this, we will definitely want a way to determine which processor is which.
For example, many motherboards ship with integrated NVIDIA graphics proces-
sors. When a manufacturer or user adds a discrete graphics processor to this
computer, it then possesses two CUDA-capable processors. Some NVIDIA prod-
ucts, like the GeForce GTX 295, ship with two GPUs on a single card. Computers
that contain products such as this will also show two CUDA-capable processors.
Before we get too deep into writing device code, we would love to have a
mechanism for determining which devices (if any) are present and what capa-
bilities each device supports. Fortunately, there is a very easy interface to
determine this information. First, we will want to know how many devices in the
system were built on the CUDA Architecture. These devices will be capable of
executing kernels written in CUDA C. To get the count of CUDA devices, we call
cudaGetDeviceCount(). Needless to say, we anticipate receiving an award
for Most Creative Function Name.
int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
After calling cudaGetDeviceCount(), we can then iterate through the devices
and query relevant information about each. The CUDA runtime returns us these
properties in a structure of type cudaDeviceProp. What kind of properties
can we retrieve? As of CUDA 3.0, the cudaDeviceProp structure contains the
following:
struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
9. QueryInG devIces
29
evices
int minor;
int clockRate;
size_t textureAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture2D[2];
int maxTexture3D[3];
int maxTexture2DArray[3];
int concurrentKernels;
}
Some of these are self-explanatory; others bear some additional description (see
Table 3.1).
Table 3.1 CUDA Device Properties
DEvICE ProPErty n
char name[256]; An ASCII string identifying the device (e.g.,
"GeForce GTX 280")
size_t totalGlobalMem The amount of global memory on the device in
bytes
size_t sharedMemPerBlock The maximum amount of shared memory a single
block may use in bytes
int regsPerBlock The number of 32-bit registers available per block
int warpSize The number of threads in a warp
size_t memPitch The maximum pitch allowed for memory copies in
bytes
Continued
10. C
30
DEvICE ProPErty DESCrIPtIoN
int maxThreadsPerBlock The maximum number of threads that a block may
contain
int maxThreadsDim[3] The maximum number of threads allowed along
each dimension of a block
int maxGridSize[3] The number of blocks allowed along each
dimension of a grid
size_t totalConstMem The amount of available constant memory
int major The major revision of the device’s compute
capability
int minor The minor revision of the device’s compute
capability
size_t textureAlignment The device’s requirement for texture alignment
int deviceOverlap A boolean value representing whether the device
can simultaneously perform a cudaMemcpy()
and kernel execution
int multiProcessorCount The number of multiprocessors on the device
int kernelExecTimeoutEnabled A boolean value representing whether there is a
runtime limit for kernels executed on this device
int integrated A boolean value representing whether the device is
an integrated GPU (i.e., part of the chipset and not a
discrete GPU)
int canMapHostMemory A boolean value representing whether the device
can map host memory into the CUDA device
address space
int computeMode A value representing the device’s computing mode:
default, exclusive, or prohibited
int maxTexture1D The maximum size supported for 1D textures
Table 3.1 Caption needed (Continued)
11. QueryInG devIces
31
evices
DEvICE ProPErty DESCrIPtIoN
int maxTexture2D[2] The maximum dimensions supported for 2D
textures
int maxTexture3D[3] The maximum dimensions supported for 3D
textures
int maxTexture2DArray[3] The maximum dimensions supported for 2D
texture arrays
int concurrentKernels A boolean value representing whether the device
supports executing multiple kernels within the
same context simultaneously
We’d like to avoid going too far, too fast down our rabbit hole, so we will not
go into extensive detail about these properties now. In fact, the previous list is
missing some important details about some of these properties, so you will want
to consult the NVIDIA CUDA Programming Guide for more information. When you
move on to write your own applications, these properties will prove extremely
useful. However, for now we will simply show how to query each device and report
the properties of each. So far, our device query looks something like this:
#include "../common/book.h"
int main( void ) {
cudaDeviceProp prop;
int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
for (int i=0; i< count; i++) {
HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );
//Do something with our device's properties
}
}
Table 3.1 CUDA Device Properties (Continued)
12. C
32
Now that we know each of the fields available to us, we can expand on the
ambiguous “Do something...” section and implement something marginally less
trivial:
#include "../common/book.h"
int main( void ) {
cudaDeviceProp prop;
int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
for (int i=0; i< count; i++) {
HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );
printf( " --- General Information for device %d ---n", i );
printf( "Name: %sn", prop.name );
printf( "Compute capability: %d.%dn", prop.major, prop.minor );
printf( "Clock rate: %dn", prop.clockRate );
printf( "Device copy overlap: " );
if (prop.deviceOverlap)
printf( "Enabledn" );
else
printf( "Disabledn" );
printf( "Kernel execition timeout : " );
if (prop.kernelExecTimeoutEnabled)
printf( "Enabledn" );
else
printf( "Disabledn" );
printf( " --- Memory Information for device %d ---n", i );
printf( "Total global mem: %ldn", prop.totalGlobalMem );
printf( "Total constant Mem: %ldn", prop.totalConstMem );
printf( "Max mem pitch: %ldn", prop.memPitch );
printf( "Texture Alignment: %ldn", prop.textureAlignment );
13. usInG devIce ProPertIes
33
roperties
printf( " --- MP Information for device %d ---n", i );
printf( "Multiprocessor count: %dn",
prop.multiProcessorCount );
printf( "Shared mem per mp: %ldn", prop.sharedMemPerBlock );
printf( "Registers per mp: %dn", prop.regsPerBlock );
printf( "Threads in warp: %dn", prop.warpSize );
printf( "Max threads per block: %dn",
prop.maxThreadsPerBlock );
printf( "Max thread dimensions: (%d, %d, %d)n",
prop.maxThreadsDim[0], prop.maxThreadsDim[1],
prop.maxThreadsDim[2] );
printf( "Max grid dimensions: (%d, %d, %d)n",
prop.maxGridSize[0], prop.maxGridSize[1],
prop.maxGridSize[2] );
printf( "n" );
}
}
Using Device Properties
Other than writing an application that handily prints every detail of every CUDA-
capable card, why might we be interested in the properties of each device in our
system? Since we as software developers want everyone to think our software is
fast, we might be interested in choosing the GPU with the most multiprocessors
on which to run our code. Or if the kernel needs close interaction with the CPU,
we might be interested in running our code on the integrated GPU that shares
system memory with the CPU. These are both properties we can query with
cudaGetDeviceProperties().
Suppose that we are writing an application that depends on having double-
precision floating-point support. After a quick consultation with Appendix A of the
NVIDIA CUDA Programming Guide, we know that cards that have compute capa-
bility 1.3 or higher support double-precision floating-point math. So to success-
fully run the double-precision application that we’ve written, we need to find at
least one device of compute capability 1.3 or higher.
14. C
34
Based on what we have seen with cudaGetDeviceCount() and
cudaGetDeviceProperties(), we could iterate through each device and look
for one that either has a major version greater than 1 or has a major version of
1 and minor version greater than or equal to 3. But since this relatively common
procedure is also relatively annoying to perform, the CUDA runtime offers us an
automated way to do this. We first fill a cudaDeviceProp structure with the
properties we need our device to have.
cudaDeviceProp prop;
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 3;
After filling a cudaDeviceProp structure, we pass it to
cudaChooseDevice() to have the CUDA runtime find a device that satisfies
this constraint. The call to cudaChooseDevice() returns a device ID that we
can then pass to cudaSetDevice(). From this point forward, all device opera-
tions will take place on the device we found in cudaChooseDevice().
#include "../common/book.h"
int main( void ) {
cudaDeviceProp prop;
int dev;
HANDLE_ERROR( cudaGetDevice( &dev ) );
printf( "ID of current CUDA device: %dn", dev );
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 3;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );
printf( "ID of CUDA device closest to revision 1.3: %dn", dev );
HANDLE_ERROR( cudaSetDevice( dev ) );
}
15. eview
35
eview
Systems with multiple GPUs are becoming more and more common. For
example, many of NVIDIA’s motherboard chipsets contain integrated, CUDA-
capable GPUs. When a discrete GPU is added to one of these systems, you
suddenly have a multi-GPU platform. Moreover, NVIDIA’s SLI technology allows
multiple discrete GPUs to be installed side by side. In either of these cases, your
application may have a preference of one GPU over another. If your application
depends on certain features of the GPU or depends on having the fastest GPU
in the system, you should familiarize yourself with this API because there is no
guarantee that the CUDA runtime will choose the best or most appropriate GPU
for your application.
Chapter Review
We’ve finally gotten our hands dirty writing CUDA C, and ideally it has been less
painful than you might have suspected. Fundamentally, CUDA C is standard C
with some ornamentation to allow us to specify which code should run on the
device and which should run on the host. By adding the keyword __global__
before a function, we indicated to the compiler that we intend to run the function
on the GPU. To use the GPU’s dedicated memory, we also learned a CUDA API
similar to C’s malloc(), memcpy(), and free() APIs. The CUDA versions of
these functions, cudaMalloc(), cudaMemcpy(), and cudaFree(), allow us
to allocate device memory, copy data between the device and host, and free the
device memory when we’ve finished with it.
As we progress through this book, we will see more interesting examples of
how we can effectively use the device as a massively parallel coprocessor. For
now, you should know how easy it is to get started with CUDA C, and in the next
chapter we will see how easy it is to execute parallel code on the GPU.